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ABSTRACT 


Current  methods  of  inform  ation  retrieva  1  (IR)  are  adequate  for  everyday  search 
needs,  but  they  are  not  appropriate  for  many  military  and  industrial  tasks.  The  underlying 
mechanism  of  typical  search  m  ethods  is  based  upon  keyword  m  atching,  which  has 
demonstrated  very  poor  perform  ance  over  highly  technical  requirements  documents 
found  within  the  field  of  acquisitions.  Inst  ead  of  matching  keywords,  IR  m  ethods  that 
understand  the  m  eaning  of  the  words  in  a  que  ry  are  needed  to  pr  ovide  the  necessary 
performance  over  these  types  of  documents;  this  is  known  as  semantic  search. 

This  work  utilizes  sound  software  engi  neering  practices  to  specify,  design,  and 
develop  a  modular  fram  ework  to  aid  in  th  e  design,  testing,  and  developm  ent  of  new 
semantic  search  m  ethods  and  IR  techniques,  in  general.  T  he  developm  ent  of  Modular 
Search  Engine  fra  mework  is  documented  in  it  s  entirety,  fro  m  user  needs  analysis  to  the 
production  of  a  full  application  programming  interface. 

By  exploiting  the  pow  erful  techniques  of  polym  orphism  and  object-oriented 
programming  in  the  Java  program  ming  langu  age,  users  are  able  to  design  new  IR 
techniques  that  will  function  seamlessly  within  the  framework. 

Finally,  a  reference  imple  mentation  is  provided  as  a  proof-of-concept  to 
demonstrate  the  capabilities  and  usefulness  of  the  framework  design. 
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I.  INTRODUCTION 


A.  BACKGROUND 

For  m  any  users’  needs,  the  advent  of  Google  has  trivialized  the  problem  of 
finding  relevant  docum  ents  on  the  Internet.  Prior  to  Go  ogle,  th  e  search  task  was 
accomplished  by  performing  a  simple  keyword  search,  which  finds  pages  that  contain  the 
words  in  the  query  and  rank  orders  them  according  to  how  strongly  those  words  matched. 
Google’s  revolution  cam  e  not  by  changing  the  funda  mentals,  as  the  pages  returned  are 
still  thos  e  that  m  atch  the  keyword  s  in  th  e  query,  but  instead  by  changing  the  order  in 
which  the  returned  pages  are  presented.  G  oogle  evaluates  the  returned  pages  according 
to  the  PageRank  algorithm  and  t  hen  presents  those  pages  in  order  of  decreasing 
PageRank  value. 

Thus,  the  innovation  behind  Google  is  in  the  PageRank  algorithm  .  Simply  put, 
the  algorithm  ranks  pages  according  to  sociological  importance  by  observing  the  number 
of  hyperlinks  that  point  to  each  page.  The  m  ore  links  that  po  int  to  a  particular  page,  the 
higher  that  page  is  in  the  “society.”  Add  itionally,  some  pages  are  given  extra  authority 
based  upon  the  num  her  and  rank  of  the  pages  to  which  they  poi  nt.  Therefore,  if  several 
pages  with  high  autho  rity  all  ref  er  to  a  p  articular  p  age,  it  will  be  ran  ked  highe  r  than 
another  page  that  has  only  lo  w-ranking  pages  pointing  to  it  [1],  PageRank  is  essentially 
analogous  to  the  stereotypica  1  high-school  social  popularity  status:  If  you  can  becom  e 
associated  with  a  “cool  kid,”  then  your  social  status  will  be  elevated  respectively. 

B.  MOTIVATION 

Despite  the  fact  that  Google  works  well  fo  r  most  search  tasks,  for  many  military 
and  industrial  tasks,  popularity  is  not  a  sufficient  m  etric.  Consider  a  software  engineer 
who  is  tasked  with  developi  ng  a  sophisticated  system  .  He  separates  his  design  into 
subcomponents  designed  to  achieve  particular  tasks  that  contribute  to  the  operation  of  the 
whole.  Before  he  sets  of  f  to  start  buildi  ng  each  subcom  ponent  from  scratch,  he  first 
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searches  his  com  pany’s  database  to  find  out  if  any  subcom  ponent  (or  part  thereof) 
already  exists  in  order  to  not  duplicate  effort. 

So,  he  searches  over  the  database  of  re  quirements  documents  with  a  particular 
search  query,  and  if  he  is  extremely  lucky,  the  best  component  in  the  database  that  meets 
his  needs  will  have  been  described  with  the  same  set  of  words  in  his  query.  Chances  are, 
however,  that  those  particular  words  were  no  t  used  to  describe  the  existing  component, 
but  rather  a  different  set  of  wo  rds  with  the  exact  sam  e  meaning.  In  this  case,  the  search 
will  not  return  what  he  needs,  regardless  of  the  popularity  o  f  the  documents  returned:  If 
the  keywords  are  incorrect,  he  will  never  fi  nd  the  component  that  he  is  looking  for.  He 
then  resorts  to  altering  hi  s  set  of  keywords  with  synonym  s,  in  hopes  of  choosing  the 
particular  words  that  were  used  to  descri  be  the  relevant  system  in  the  database,  a 
particularly  time-consuming  and  frustrating  effort. 

The  problem  described  above  is  the  semantic  search  problem,  and  it  is  a  particular 
issue  in  Department  of  Defense  (DoD)  acquisitions.  In  August  2006,  Program  Executive 
Officer  of  Integra  ted  W  arfare  Sys  terns  (PEO-IW  S)  established  the  Software  Hard  ware 
Asset  Reuse  Enterprise  (SHARE)  repository  to  enable  th  e  reuse  of  com  bat  system 
software  and  related  assets  [2],  In  order  to  make  effective  use  of  the  SHARE  repository, 
the  DoD  needs  an  effective  solution  to  the  problem  of  semantic  search. 

C.  OBJECTIVES 

The  objectives  of  this  thesis  are  to  utilize  sound  software  engineering  practices  to 
specify,  design,  and  develop  a  m  odular  fr  amework  for  de  veloping,  imple  menting,  and 
testing  new  sem  antic  search  m  ethods  and  in  formation  retrieval  (IR)  techniques,  in 
general.  These  objectives  shall  be  accomplished  through  the  following: 

•  Thorough  system  specification  and  desi  gn  using  UML  and  other  software 
engineering  practices. 

•  Development  of  a  modular,  object  -oriented  Java  package  whose 
components  can  be  used  to  build  a  fully  functional  search  engine 
consisting  of  one  or  more  independ  ent  IR  m  odules.  The  addition  of  a 
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single  IR  module  should  not  incur  a  large  in  tegration  effort  as  m  easured 
by  the  num  her  of  classes  and  m  ethods  that  need  to  be  imple  mented. 
Additionally,  the  firam  ework  will  incorporate  basic  m  anagement 
functionality  for  use  by  adm  inistrators,  such  as  adding  and  deleting 
documents  from  a  corpus. 

•  Demonstrate  the  m  odular  fram  ework  by  developing  a  reference 

implementation  that  consists  of  at  le  ast  two  IR  modules  whose  results  are 
combined  to  produce  a  single  list  of  results  to  the  user. 

D.  SCOPE 

The  scope  of  this  thesis  focuses  on  the  design  of  a  modular  framework  that  allows 
multiple  IR  methods  to  run  simultaneously  on  a  selected  corpus  of  data  with  each  method 
returning  a  list  of  search  results.  Th  e  framework  also  provides  for  the  developm  ent  of 
methods  to  com  bine  the  lists  returned  from  eac  h  IR  m  ethod  into  a  si  ngle  list  that  is 
returned  to  the  user.  The  scope  of  this  th  esis  does  not  include  the  developm  ent  of  a  new 
method  for  IR. 

E.  THESIS  ORGANIZATION 

Chapter  II  establishes  the  system  and  us  er  req  uirements  necessa  ry  to  design  a 
comprehensive  and  modular  framework  for  implementing  multiple  IR  techniques  within 
a  single  search  engine.  A  detailed  use  case  analysis  is  performed. 

Chapter  III  formalizes  the  requirement  specifications  into  an  architectural  design 
by  decomposing  the  system  into  a  subset  of  systems.  The  use  cas  es  from  Chapter  II  are 
expanded  and  developed  in  detail. 

Chapter  IV  descr  ibes  and  demonstra  tes  th  e  f  unctionality  of  a  ref  erence 
implementation;  in  addition,  this  chapter  describes  an  evaluation  metric  and  demonstrates 
how  to  apply  the  measure. 

Chapter  V  contains  a  summary  and  recommendations  for  future  work. 

The  Appendix  provides  a  UML  reference  key  to  the  figures  in  Chapters  II  and  III. 
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II.  VISION  DOCUMENT 


A.  INTRODUCTION 

1.  Purpose  of  the  Vision  Document 

This  chapter  provides  the  foundation,  bac  kground,  and  reference  for  all  future, 
more  detailed,  development.  Here,  the  high-  level  user  needs  are  gath  ered,  analyzed,  and 
defined  to  identify  the  require  d  features  needed  for  a  fully  functional  Modular  S  earch 
Engine. 


2.  Framework  Overview 

The  Modular  Search  Engine  provides  the  fra  mework  for  future  design, 
development,  testing,  implementation,  and  deployment  of  IR  methods.  Developers  need 
only  adhere  to  the  design  requirements,  inherite  d  via  abstract  super  cl  asses,  in  order  to 
have  a  new  IR  technique  integrate  seamlessly  into  the  Modular  Search  Engine. 

B.  USER  DESCRIPTION 

1.  User  Demographics 

The  primary  users  of  the  Modular  Search  Engine  fra  mework  are  any  student  or 
researcher  looking  to  develop  and  test  ne  w  m  ethods  of  IR  and/or  m  etasearch. 
Specifically,  Draeger  used  the  Modular  Sear  eh  Engine  fra  mework  to  implem  ent  a  new 
semantic  search  te  chnique  to  help  solve  th  e  p  roblems  of  searching  ov  er  requ  irements 
documents  [3]. 

Additionally,  the  Modular  Search  Engine  framework  can  be  used  to  develop  fully 
functional  applications  for  end-users  needing  to  conduct  searches  over  text  corpora.  Such 
applications  would  req  uire  adm  inistrative  con  trol  and  functionality  to  update  and 
maintain  the  corpora. 
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2. 


User  Profiles 


Students  and  IR  researchers  at  NPS  and  ot  her  academic  universities  will  need  to 
be  fa  miliar  with  the  Java  programm  ing  language  in  order  to  use  the  Modular  Search 
Engine  framework. 

End-users,  f  or  whom  a  pplications  have  been  built  u  sing  the  Modular  Search 
Engine  fram  ework,  need  not  have  any  spec  ilic  knowledge  of  the  interworking  of  the 
application.  Such  users  only  need  basic  co  mputer  knowledge  to  launch  the  application 
and  conduct  searches  over  the  corpus  for  which  the  application  was  designed. 

3.  User  Environment 

Users  of  the  framework  will  need  a  computer  system  that  enables  development  in 
the  Java  programm  ing  language.  While  not  mandatory,  a  developing  environm  ent  such 
as  Eclipse  or  NetBeans  is  recommended.  At  minimum,  users  will  need  a  text  editor  and  a 
current  version  of  the  Java  SE  Developm  ent  Kit  provided  by  Sun  Mi  crosystems  in  order 
to  write,  build,  and  run  their  applications. 

End-user  applications  developed  using  the  Modular  Search  Engine  framework  can 
be  run  on  any  computer  operating  system  utilizing  a  current  Java  Runtime  Environment, 
also  provided  by  Sun  Microsystems. 

4.  Key  User  Needs 

When  conducting  research  in  this  field,  comparing  different  IR  m  ethods  against 
one  another  to  determ  ine  the  m  ethod  with  the  best  perfonn  ance  is  im  portant.  The 
Modular  Search  Engine  framework  provides  the  architecture  and  data  structures  that  each 
IR  method  must  utilize  to  simplify  such  comparisons. 

One  additio  nal  and  important  area  of  study  i  n  the  field  of  IR  is  known  as 
metasearch.  Metasearch  is  the  process  of  fusing  or  merging  the  ranked  lists  of  documents 
returned  from  different  m  ethods  or  system  s  in  order  to  produce  a  combined  list  whose 
quality  (as  m  easured  via  the  p  erformance  m  etrics  m  entioned  above)  is  greater  th  an  or 
equal  to  any  of  the  lists  from  which  it  was  cr  eated  [4].  Given  the  ability  to  im  prove  the 
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quality  of  results  returned  to  th  e  user  and  the  m  odular  natu  re  of  the  fram  ework, 
metasearch  has  been  included  in  the  design  of  the  Modular  Search  Engine  from  the 
ground  up,  and  users  are  provided  with  the  struct  ure  in  which  to  build  their  m  etasearch 
techniques. 


5.  Alternatives 

Each  student  or  IR  research  er  is  certainly  free  to  develop,  test,  and  implem  ent 
new  IR  techniques  without  the  use  of  the  Modular  Search  Engine  f  ramework.  They 
would,  however,  be  required  to  spend  va  luable  tim  e  impl  ementing  th  e  entir  e 

infrastructure  themselves  instead  of  on  the  development  of  the  IR  method.  Additionally, 
it  is  highly  unlikely  that  any  two  IR  tec  hniques  developed  by  different  authors  would 
work  cohesively  in  the  sa  me  system  wit  hout  extensive  modifica  tions  to  one  or  both 
authors’  source  code. 

C.  FRAMEWORK  OVERVIEW 

1.  Framework  Perspective 

The  Modular  Search  Engine  fram  ework’ s  architectu  re  allows  m  ultiple  IR 
techniques  to  run  simultaneously  on  a  user’s  query  over  a  selected  corpus  of  documents. 
The  architecture  then  combines  the  results  of  each  into  a  single  ranked  list  that  is  returned 
to  the  user.  The  fram  ework  is  designed  such  that  each  IR  technique,  k  nown  within  the 
framework  as  a  Search  Module,  need  not  be  aware  of  any  other  Search  Module  within  the 
Modular  Search  Engine. 

2.  Framework  Position  Statement 

IR  researchers  can  ben  elit  from  a  comm  on  framework  in  which  to  develop  and 
test  new  IR  techniques.  The  Modular  Sear  eh  Engine  fram  ework  provides  all  of  the 
necessary  overhead  and  design  constraints  necessary  to  streamline  design  efforts  into  the 
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development  of  new  IR  techniques.  Additionally,  the  framework  also  provides  sufficient 
structure  to  develop  a  fully  functional  end-us  er  application  for  searching  over  given  data 
corpora. 


3.  Assumptions  and  Dependencies 

The  Modular  Search  Engine  firam  ework  is  written  in  the  Java  p  rogramming 
language  and  applications  developed  with  the  framework  can  be  run  on  any  platfonn  with 
the  current  Java  Runtime  Environment  installed.  The  data,  over  which  a  Modular  Search 
Engine  application  m  ay  conduct  searches,  is  independent  of  the  firam  ework  i  tself; 
however,  th  e  fram  ework  provides  the  necess  ary  classes  in  to  which  th  e  data  m  ust  be 
converted  for  use  within  the  application. 

D.  FRAMEWORK  FEATURES 

1.  Data  Access  and  Management 

a.  Document 

The  basic  data  element  within  the  Modular  Search  Engine  framework  is  a 
document.  At  a  minimum  a  document  consists  of  a  unique  identification  number,  known 
as  a  document  ID,  and  a  body  of  text.  However,  a  docum  ent  may  contain  m  uch  more 
information  e.g.,  an  author,  bibliographical  information,  date  writ  ten,  etc.  For  this 
reason,  this  basic  document  m  odel  will  likely  need  to  be  extended  in  order  to  capture  the 
additional  infonnation  that  may  exist. 

b.  Corpus 

A  collection  of  documents  that  have  similar  underlying  structure  comprise 
a  corpus.  In  the  realm  of  IR  research,  a  co  rpus  is  usually  a  fixed  set  of  docum  ents  over 
which  IR  techniques  are  tested  and  com  pared  against  one  another.  To  this  end,  read 
access  to  the  data  is  the  minimum  capability  required  to  access  the  data  and  perform  these 
types  of  operations.  However,  all  corpora  need  not  remain  static.  As  such,  the  Modular 
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Search  Engine  framework  is  designed  with  this  in  mind  and  includes  the  functionality  to 
add  and  delete  docum ents  from  a  corpus.  Such  functions  are  expected  to  be  used  by  an 
administrator  needing  to  maintain  the  data  in  a  given  corpus. 

2.  Resource  Access  and  Management 

a.  Hard  Disk  A  ccess 

In  general,  IR  techniques  do  not  r  ead  through  an  entire  corpus  of 
documents  on  the  hard  disk  each  time  they  perform  a  search.  Instead,  they  each  create  an 
internal  representation  of  the  corpus,  ca  lied  an  index,  which  each  uses  to  conduct 
searches.  Accordingly,  each  IR  technique  is  expected  to  store  its  respective  index  on  the 
hard  disk  for  subsequ  ent  access.  This  use  of  hard  d  isk  space  will  save  s  ignificant 
amounts  of  tim  e  and  resources  by  preventing  each  technique  from  having  to  re-build  its 
index  from  the  original  corpus  every  time  the  system  is  launched. 

b.  Threading 

The  Modular  Search  Engine  firam  ework  has  adopted  the  principle  that  no 
operation  perfonn  ed  by  any  individual  IR  te  chnique  shall  be  forced  to  wait  on  the 
operations  of  another  IR  technique.  As  such,  the  framework  has  been  designed  to 
maximize  the  use  of  threading,  and  therefore  all  operations  perform  ed  by  individual  IR 
techniques  shall  be  run  by  independent  threads. 

c.  Heap  Space 

Most  IR  techniques  require  large  am  ounts  of  working  memory  to  function 
and  even  more  to  be  efficient  at  returning  qua  lity  results  to  the  user  in  a  tim  ely  manner. 
By  default  the  Java  Runtim  e  Environm  ent  a  llocates  an  in  itial  32  MB  to  the  heap  and 
allows  it  to  grow  to  a  m  aximum  of  128  MB.  This,  unfortunately,  is  not  likely  to  be 
enough  m  emory  for  the  Modular  Search  Engine  fram  ework  to  perform  e  fficiently, 
especially  as  multiple  IR  techniques  are  added  to  a  single  system.  As  such,  when  running 
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a  Modular  Search  Engine  application,  it  is  recommended  to  use  the  maximum  amount  of 
memory  that  a  given  computer  will  allow  the  Java  Runtime  Environment  to  use. 

E.  USE  CASE 

Use  case  scenarios  are  a  critic  al  initial  step  in  determ  ining  the  requirements  of  a 
system  by  analyzing  th  e  scenario  s  in  which  actors  will  in  teract  with  a  system  and  how 
that  system  should  respond  to  the  actors’  actions  [5].  The  use  cases  identified  in  this 
section  will  become  the  primary  functions  of  the  Modular  Search  Engine  fra  mework  and 
will  be  developed  in  detail  throughout  Chapter  III.  Figure  1  is  the  u  se  case  diagram  for 
the  Modu  lar  Search  Engine  Fram  ework;  belo  w  the  figure,  each  of  the  seven  us  e  cas  e 
scenarios  is  described  in  detail. 


Figure  1.  Use  Case  Diagram 
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1. 


Add  Document 


Use  case:  UC-1  Add  Document 

Primary  Actor:  Ad  ministrator 

Stakeholders  and  Interests: 

•  Administrator  wants  to  add  a  doc  ument  into  a  corpus  so  the 
document  can  be  included  in  search  queries  by  the  end-user. 

Entry  conditions: 

•  Administrator’s  application  is  running. 

•  The  corpus  is  accessible  for  writing. 

•  Document  object  is  created  in  system  memory. 

Exit  conditions: 

•  Document  successfully  added  to  the  corpus  in  m  emory  a  nd  on 
disk. 

•  Document  successfully  added  to  each  IR  technique  in  the  system. 

Flow  of  events: 

1.  Administrator  identifies  the  document  to  be  added. 

2.  The  document  is  added  to  the  corpus  on  disk  and  in  memory. 

3.  The  document  is  added  to  each  IR  technique. 

Special  Considerations: 

1 .  After  the  addition  of  a  docum  ent  into  a  corpus,  the  index  models 
for  each  IR  technique  will  need  to  be  updated/re-built. 

2.  Each  IR  te  chnique  shall  return  to  the  system  if  the  docum  ent  was 
successfully  added. 

3.  If  any  IR  technique  was  not  sue  cessful  in  adding  the  docum  ent, 
then  the  sys  tern  as  a  whole  is  cons  idered  to  hav  e  failed  to  add  the 
document. 

4.  If  the  document  fails  to  be  added  to  the  corpus  in  step  2  of  the  flow 
of  events,  above,  then  the  failure  is  i  mmediately  returned  to  the 
system,  and  attem  pts  to  add  the  docum  ent  to  the  system  ’s  I  R 
methods  are  abandoned. 
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2. 


Delete  Document 


Use  case:  UC-2  Delete  Document 

Primary  Actor:  Ad  ministrator 

Stakeholders  and  Interests: 

•  Administrator  wants  to  delete  a  document  from  a  corpus  so  that  the 
document  is  no  longer  included  in  search  queries  by  the  end-user. 

Entry  conditions: 

•  Administrator’s  application  is  running. 

•  The  corpus  is  accessible  for  writing. 

•  The  document  ID  of  the  document  to  be  deleted  is  known. 

Exit  conditions: 

•  Document  successfully  deleted  from  the  corpus  in  m  emory  and  on 
disk. 

•  Document  successfully  deleted  fir  om  each  IR  techniqu  e  in  th  e 
system. 

Flow  of  events: 

1 .  Administrator  identifies  the  document  to  be  deleted. 

2.  The  document  is  deleted  from  to  the  corpus  on  disk  and  in 
memory. 

3.  The  document  is  deleted  from  each  IR  technique. 

Special  Considerations: 

1 .  After  the  deletion  of  a  docum  ent  from  a  corpus,  the  index  models 
for  each  IR  technique  will  need  to  be  updated/re-built. 

2.  Each  IR  te  chnique  shall  return  to  the  system  if  the  docum  ent  was 
successfully  deleted. 

3.  If  any  IR  technique  was  not  succ  essful  in  deleting  the  docum  ent, 
then  the  sys  tern  as  a  whole  is  cons  idered  to  have  failed  to  delete 
the  document. 

4.  If  the  document  fails  to  be  deleted  from  the  corpus  in  step  2  of  the 
flow  of  events,  above,  then  the  failure  is  imm  ediately  returned  to 
the  system,  and  attempts  to  delete  the  document  from  the  system’s 
IR  methods  are  abandoned. 
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3. 


Build  Index 


Use  case:  UC-3  Build  Index 

Primary  Actors:  Administrator  &  Researcher 

Stakeholders  and  Interests: 

•  Administrator  or  resea  rcher  wants  each  IR  tec  hnique  to  build  its 
respective  index  of  the  system  corpus. 

Entry  conditions: 

•  Administrator  or  researcher’s  application  is  running. 

•  The  corpus  is  accessible  for  reading. 

Exit  conditions: 

•  Each  IR  technique  in  th  e  system  has  built  its  respective  index  of 
the  corpus 

Flow  of  events: 

1 .  Administrator  or  res  earcher  prov  ides  the  neces  sary  ins  truction  to 
the  system. 

2.  Each  IR  technique  builds  its  respective  index  of  the  corpus. 

Special  Considerations: 

1.  This  functionality  is  designed  to  be  optimized  at  the  level  of  each 
IR  techn  ique  so  that  u  nnecessary  work  is  not  perfonn  ed.  For 
example,  if  there  has  no  t  been  a  change  to  the  corpus,  then  there 
should  be  no  need  to  build  a  new  index.  If  an  i  ndividual  search 
technique  is  instru  cted  to  build  a  n  ew  index  in  this  ca  se,  then  it 
should  recognize  that  no  actual  change  has  been  m  ade  and  should 
not  spend  the  com  puter’s  resource  s  to  build  a  new  index  that  is 
identical  to  the  current  index. 

2.  Each  IR  te  chnique  sha  11  re  turn  to  the  system  if  the  inde  x  was 
successfully  built. 

3.  If  any  IR  technique  was  not  success  ful  in  building  its  index,  then 
the  system  as  a  whole  is  considered  to  have  failed  the  operation. 
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4. 


Force  Build  Index 


Use  case:  UC-4  Force  Build  Index 

Primary  Actors:  Administrator  &  Researcher 

Stakeholders  and  Interests: 

•  Administrator  or  res  earcher  want  s  to  force  each  IR  technique  to 
build  its  respective  index  of  the  system  corpus. 

Entry  conditions: 

•  Administrator  or  researcher’s  application  is  running. 

•  The  corpus  is  accessible  for  reading. 

Exit  conditions: 

•  Each  IR  tec  hnique  in  th  e  system  has  f  orcibly  b  uilt  its  resp  ective 
index  of  the  corpus. 

Flow  of  events: 

1 .  Administrator  or  researcher  provid  es  the  neces  sary  ins  truction  to 
the  system. 

2.  Each  IR  te  chnique  f  orcibly  builds  its  re  spective  index  o  f  the 
corpus. 

Special  Considerations: 

1.  This  use  case  is  the  complement  to  UC-3.  It  is  designed  to  ensure 
that  each  IR  technique  in  the  sy  stem  builds  a  new  index  of  the 
corpus. 

2.  EachIRte  chnique  sha  11  re  turn  to  the  system  if  the  inde  x  was 
successfully  built. 

3.  If  any  IR  technique  was  not  success  ful  in  building  its  index,  then 
the  system  as  a  whole  is  considered  to  have  failed  the  operation. 
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5. 


Ready  Check 


Use  case:  UC-5  Ready  Check 

Primary  Actor:  End-user  &  Researcher 

Stakeholders  and  Interests: 

•  End-user  or  researcher  wants  to  en  sure  that  each  IR  m  ethod  in  the 
system  is  ready  to  receive  a  search  query. 

Entry  conditions: 

•  The  end-user  or  researcher’s  application  is  running. 

Exit  conditions: 

•  Each  IR  method  in  the  system  has  returned  its  ready  status. 

Flow  of  events: 

1 .  End-user  or  researcher  requests  a  ready  check  of  the  system. 

2.  Each  individual  IR  method  returns  its  ready  status. 

Special  Considerations: 

1 .  If  any  one  of  the  individual  IR  methods  is  not  ready,  then  the 
system’s  status,  as  a  whole,  is  returned  as  not  ready. 
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6.  Single  Query  Search 


Use  case:  UC-6  Single  Query  Search 

Primary  Actor:  End-user,  Researcher 

Stakeholders  and  Interests: 

•  End-user  or  research  er  wants  to  perfor  m  a  single  query  search  of 
the  corpus. 

Entry  conditions: 

•  The  end-user  or  researcher’s  application  is  running. 

•  The  system  is  ready  as  described  in  UC-5. 

Exit  conditions: 

•  The  system  has  returned  the  results  of  the  single  query  search. 

Flow  of  events: 

1 .  End-user  or  researcher  submits  a  single  query  to  the  system. 

2.  Each  individual  IR  technique  in  the  system  performs  a  search  using 
the  provided  query  and  returns  its  results. 

3.  All  of  the  results  returned  from  t  he  individual  IR  m  ethods  are 
combined  to  return  a  single  set  of  results  to  the  user  or  researcher. 

Special  Considerations: 

None. 
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7. 


Multiple  Query  Search 


Use  case:  UC-7  Multiple  Query  Search 

Primary  Actor:  Researcher 

Stakeholders  and  Interests: 

•  Researcher  wants  to  perfonn  multiple  query  searches  of  the  corpus. 

Entry  conditions: 

•  The  researcher’s  application  is  running. 

•  The  system  is  ready  as  described  in  UC-5. 

Exit  conditions: 

•  The  system  has  returned  the  results  of  the  multiple  query  search. 

Flow  of  events: 

1 .  Researcher  submits  a  list  of  queries  to  the  system. 

2.  Each  individual  IR  technique  in  the  system  performs  a  search  for 
each  of  the  provided  queries  and  returns  results  for  each. 

3.  All  of  the  results  returned  from  t  he  individual  IR  m  ethods  are 
combined  to  return  a  s  ingle  set  of  results  for  each  query  to  the 
researcher. 

Special  Requirements: 

1 .  This  use  ca  se  is  specif  ically  des  igned  to  allow  f  or  individual  IR 
methods  to  optimize  the  simultaneous  search  of  multiple  queries  in 
order  to  preserve  system  resources. 
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III.  SYSTEM  DESIGN 


A.  INTRODUCTION 

This  chapter  converts  the  ge  neral  analysis  m  odel  descri  bed  in  Chapter  II  into  a 
detailed  system  design.  This  evolution  will  begin  with  a  thorough  study  of  the  use  case 
models,  and  it  will  continue  with  a  decom  position  of  the  system  ,  as  a  whole,  into 

architectural  and  behavioral  models  that  will  eventually  become  objects  in  the  design. 

B.  SYSTEM  ARCHITECTURE 

1.  Goals 

The  primary  goal  of  the  architecture  is  modularity.  Existing  IR  techniques  can  be 
encoded  as  SearchModu  le  objects  and  built  into  a  Modular  Search  Engine  application. 
As  new  IR  techniques  are  developed,  they  too  can  be  encoded  as  SearchModule  objects 
and  seamlessly  inserted  into  the  existing  Modular  Search  Engine  application  for  testing 
and  further  developm  ent.  As  such,  the  Sear  chModule  class  shall  be  abstract,  providing 
an  existing  template  for  extensions  to  inherit  and  follow. 

In  addition  to  new  IR  t  echniques,  ne  w  m  ethods  of  conduc  ting  m  etasearch  are 
constantly  being  researched  in  the  field,  a  nd  th  e  f  ramework  takes  this  into  ac  count  as 
well.  It  provides  researchers  with  the  ability  to  encode  different  m  etasearch  methods  as 
ModuleMixer  objects  that  can  be  interchanged  within  the  system ,  thus  keeping  with  the 
goal  of  modularity. 

Figure  2  displays  a  high  level,  conceptual,  view  of  the  internal  architecture  within 
the  Modular  Search  Engine  framework. 
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provided  back  to  User 

Figure  2.  Modular  Search  Engine  Architecture 

As  each  SearchModule  object  com  pletes  a  se  arch  request,  it  feeds  its  results,  in 
the  form  of  a  Sea  rchResults  object,  into  a  Mod  uleMixer  object  that  combines  multiple 
SearchResults  objects  into  a  single  set  of  resu  Its.  In  general,  a  Modular  Search  Engine 
implementation  would  only  use  one  ModuleMix  er  at  a  tim  e;  howeve  r,  this  is  not  a 
restriction.  In  fact,  for  th  e  purposes  of  developmental  testing  and  com  parison,  it  may  be 
beneficial  to  implement  multiple  ModuleMixer  objects  simultaneously. 

2.  Integration 

The  objects  within  th  e  framework  will  communicate  with  each  othe  r  by  directly 
calling  eac  h  other' s  pr  ocedures.  However,  no  integr  ation  will  tak  e  place  be  tween 
SearchModule  objects  because  each  is  specifically  designed  to  work  independently  of 
one  another.  As  such,  custom  designed  extensions  of  the  java. lang. Thread  class  are  used 
to  handle  comm  unication  both  to  and  from  all  SearchModule  objects  for  the  use  cases 
presented  in  Chapter  II. 
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c. 


BEHAVIORAL  DESIGN 


1.  Domain  Object  Model 

The  domain  object  model  records  the  key  concepts  in  the  Modular  Search  Engine 
framework.  Figure  3  depicts  the  various  entit  ies  involved  and  the  re  lationships  between 
them.  See  Appendix  for  a  key  to  the  figure. 


2.  Sequence  Diagrams 

Sequence  diagrams  help  for  malize  the  dyna  mic  behavior  of  the  system  by  tying 
use  cases  to  objects  and  by  showing  how  proce  sses  operate  with  one  another  and  in  what 
order.  Visualizing  the  communication  am  ong  objects  can  help  determ  ine  additional 
objects  required  to  fonnalize  th  e  use  cases  [6],  In  this  regard,  sequence  diagram  s  offer 
another  perspective  on  the  behavioral  m  odel  and  are  instrumental  in  discovering  missing 
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objects  and  grey  areas  in  the  req  uirements  s  pecilication.  The  following  sequ  ence 
diagrams  depict  the  use  cases  identified  in  Chapter  II. 

a.  Add  Document 


Figure  4  displays  the  sequence  diagram  for  adding  a  docum  ent  in  the 
Modular  Search  Engine  framework. 


i  i 

Figure  4.  Add  Document  Sequence  Diagram 
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b. 


Delete  Document 


Figure  5  displays  the  sequence  diag  ram  f  or  deleting  a  doc  ument  in  th  e 
Modular  Search  Engine  framework. 


i  i 

Figure  5.  Delete  Document  Sequence  Diagram 
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c.  Build  Index 

Figure  6  displays  the  sequence  diagra  m  for  building  the  necessary  indices 
in  the  Modular  Search  Engine  framework. 


24 


d.  Force  Build  Index 

Figure  7  displays  the  sequence  diagram  for  forcibly  building  the  necessary 
indices  in  the  Modular  Search  Engine  framework. 


Figure  7.  Force  Build  Index  Sequence  Diagram 
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e.  Ready  Check 

Figure  8  displays  the  sequence  diagram  for  determining  that  the  system  is 
ready  to  accept  a  search  query  in  the  Modular  Search  Engine  framework. 


l  i  i 

Figure  8.  Is  Ready  Sequence  Diagram 
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f.  Single  Query  Search 


Figure  9  displays  the  sequence  diag  ram  for  pe  rforming  a  single  query 
search  in  the  Modular  Search  Engine  framework. 


:User 

:Researcher 

Object 

Object 

searchFor(query) 


hc- 


loop  [for  ejach  SearchModule] 


List<results> 


:ModuleMixer 

Object 


T" 


mix(List<results>) 


results 


:SearchForQuervThread 

Object 


create 


startQ 


K- 


¥ 


results 


searchFor(query) 


N 


results 


Figure  9.  Single  Query  Search  Sequence  Diagram 


In  this  case,  the  user  is  not  norm  ally  responsible  for  redirecting  the  list  of 
results  returned  from  t  he  ModularSearchEngi  ne  object  into  the  ModuleMixer  object. 
Instead,  this  is  performed  automatically  by  the  user’s  application. 
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g.  Multiple  Query  Search 


Figure  10  displays  the  sequence  diagram  for  performing  a  multiple  query 
search  in  the  Modular  Search  Engine  framework. 


Figure  10.  Multiple  Query  Sequence  Diagram 


3.  Operational  Contracts 

Operational  contracts  represent  the  final  phase  of  the  behavioral  m  odel  design; 
they  are  built  on  the  foundations  established  by  the  use  case  specifications,  domain  object 
model,  and  sequence  diagram  s.  These  operati  onal  contracts  assign  con  crete  attributes, 
such  as  function  names,  parameters,  and  return  types,  to  the  fra  mework  components  and 
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also  provide  a  brief  definition  of  purpose  to  each.  Additionally,  the  operational  contracts 
precisely  d  efine  the  p  re-conditions  and  post-conditions  required  for  the  pro  posed 
methods. 


a.  Add  Document 

Contract:  C 1 :  Add  Document 

Method:  addDocument(Document  d) 

Cross  Reference:  UC- 1 :  Add  Document 


Pre-conditions: 

1 .  The  Corpus  object  was  successfully  constructed. 

2.  All  of  the  SearchModule  objects  were  successfully  constructed  and 
added  to  an  ArrayList. 

3.  The  ModularSearchEng  ine  object  was  successfully  constructed 
with  the  Corpus  object  and  the  A  rrayList  of  SearchModule  objects 
listed  in  pre-conditions  1  and  2  above. 

4.  The  system  has  com  pleted  a  succ  essful  call  to  buildlndex()  or 
forceBuildIndex() . 

5.  The  Document  object  to  be  added  was  successfully  constructed. 

Post-conditions: 

1.  The  ModularSearchEngine  object  constructed  and  started  an 
AddDocumentThread  object  for  each  SearchModule  object  in  the 
system. 

2.  Each  SearchModule  object' s  addD  ocument(Document  d)  m  ethod 
has  executed  and  terminated. 

3.  A  status  message  was  displayed  back  to  the  user. 

b.  Delete  Document 

Contract:  C2:  Delete  Document 

Method:  deleteDocument(int  docID) 

Cross  Reference:  UC-2:  Delete  Document 


Pre-conditions: 

1 .  The  Corpus  object  was  successfully  constructed. 

2.  All  of  the  SearchModule  objects  were  successfully  constructed  and 
added  to  an  ArrayList. 
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3.  The  ModularSearchEng  ine  object  was  successfully  constructed 
with  the  Corpus  object  and  the  A  rrayList  of  SearchModule  objects 
listed  in  pre-conditions  1  and  2  above. 

4.  The  system  has  com  pleted  a  succ  essful  call  to  buildlndex()  or 
forceBuildIndex() . 

5.  The  unique  identification  num  ber  of  the  Docum  ent  object  to  be 
deleted  is  known. 

Post-conditions: 

1 .  The  ModularSearchEngine  object  constructed  and  started  a 
DeleteDocumentThread  object  for  each  Search  Module  ob  ject  in 
the  system. 

2.  Each  Searc  hModule  object's  delet  eDocument(int  docID)  m  ethod 
has  executed  and  terminated. 

3.  A  status  message  was  displayed  back  to  the  user. 

c.  Build  Index 

Contract:  C3:  Build  Index 

Method:  buildlndex() 

Cross  Reference:  UC-3:  Build  Index 


Pre-conditions: 

1 .  The  Corpus  object  was  successfully  constructed. 

2.  All  of  the  SearchModule  objects  were  successfully  constructed  and 
added  to  an  ArrayList. 

3.  The  ModularSearchEng  ine  object  was  successfully  constructed 
with  the  Corpus  object  and  the  A  rrayList  of  SearchModule  objects 
listed  in  pre-conditions  1  and  2  above. 

Post-conditions: 

1 .  The  ModularSearchEngine  object  constructed  and  started  a 
BuildlndexThread  object  for  each  SearchModule  object  in  the 
system. 

2.  Each  SearchModule  object's  buildlndex()  method  has  executed  and 
tenninated. 

3.  A  status  message  was  displayed  to  the  user. 
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d. 


Force  Build  Index 


Contract:  C4:  Force  Build  Index 

Method:  forceBuildIndex() 

Cross  Reference:  UC-4:  Force  Build  Index 


Pre-conditions: 

1.  The  Corpus  object  was  successfully  constructed. 

2.  All  of  the  SearchModule  objects  were  successfully  constructed  and 
added  to  an  ArrayList. 

3.  The  ModularSearchEng  ine  object  was  successfully  constructed 
with  the  Corpus  object  and  the  A  rrayList  of  SearchModule  objects 
listed  in  pre-conditions  1  and  2  above. 

Post-conditions: 

1 .  The  ModularSearchEngine  object  constructed  and  started  a 
ForceBuildlndexThread  object  fo  r  each  Search  Module  object  in 
the  system. 

2.  Each  SearchModule  object'  s  fo  rceBuildIndex()  m  ethodhas 
executed,  tenninated,  and  returned  its  success  or  failure. 

3.  A  status  message  was  displayed  to  the  user. 

e.  Ready  Check 

Contract:  C5:  Ready  Check 

Method:  isReady() 

Cross  Reference:  UC-5:  Ready  Check 

Pre-conditions: 

1.  The  Corpus  object  was  successfully  constructed. 

2.  All  of  the  SearchModule  objects  were  successfully  constructed  and 
added  to  an  ArrayList. 

3.  The  ModularSearchEng  ine  object  was  successfully  constructed 
with  the  Corpus  object  and  the  A  rrayList  of  SearchModule  objects 
listed  in  pre-conditions  1  and  2  above. 

4.  The  system  has  com  pleted  a  succ  essful  call  to  buildlndex()  or 
forceBuildlndexQ . 
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Post-conditions: 


1.  The  ModularSearchEngine  object  constructed  and  started  an 
IsReadyThread  object  for  each  SearchModule  object  in  the  system. 

2.  Each  SearchModule  object’  s  isReady()  m  ethod  has  executed, 
terminated,  and  returned  its  ready  status. 

3.  A  status  message  was  displayed  to  the  user. 

f  Single  Query  Search 

Contract:  C6:  Single  Query  Search 

Method:  searchFor( String  query,  int  returnSize) 

Cross  Reference:  UC-6:  Single  Query  Search 

Pre-conditions: 

1.  The  Corpus  object  was  successfully  constructed. 

2.  All  of  the  SearchModule  objects  were  successfully  constructed  and 
added  to  an  ArrayList. 

3.  The  ModularSearchEng  ine  object  was  successfully  constructed 
with  the  Corpus  object  and  the  A  rrayList  of  SearchModule  objects 
listed  in  pre-conditions  1  and  2  above. 

4.  The  system  has  com  pleted  a  succ  essful  call  to  buildlndex()  or 
forceBuildIndex() . 

5.  The  system  has  completed  a  successful  call  to  isReady(). 

6.  The  user's  query  is  contained  within  a  String  object. 

Post-conditions: 

1 .  The  ModularSearchEngine  object  constructed  and  started  a 
SearchForQueryThread  object  for  each  SearchModule  object  in  the 
system. 

2.  Each  SearchModule  object’  s  searchFor(String  query,  int 
returnSize)  m  ethod  has  executed,  term  inated,  and  returned  a 
SearchResults  object. 

3.  The  ModularSearchEngine  object  co  llected  and  passed  all  of  the 

returned  SearchResults  objects  from  post-condition  1  into  a 
ModuleMixer  object  via  the  ModuleMixer’  s 

mix(ArrayList<SearchResults>)  method. 

4.  The  ModuleMixer  method  from  post-condition  3  returned  a  single 
SearchResults  object. 

5.  A  status  message  was  displayed  to  the  user. 
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g.  Multiple  Query  Search 


Contract:  C7:  Multiple  Query  Search 

Method:  searchFor(Set<String>  queries,  int  returnSize) 

Cross  Reference:  UC-7:  Multiple  Query  Search 

Pre-conditions: 

1.  The  Corpus  object  was  successfully  constructed. 

2.  All  of  the  SearchModule  objects  were  successfully  constructed  and 
added  to  an  ArrayList. 

3.  The  ModularSearchEng  ine  object  was  successfully  constructed 
with  the  Corpus  object  and  the  A  rrayList  of  SearchModule  objects 
listed  in  pre-conditions  1  and  2  above. 

4.  The  system  has  com  pleted  a  succ  essful  call  to  buildlndex()  or 
forceBuildIndex() . 

5.  The  system  has  completed  a  successful  call  to  isReady(). 

6.  The  researcher's  batch  of  queries  is  contained  within  a  Set<String> 
object. 

Post-conditions: 

1 .  The  ModularSearchEngine  object  constructed  and  started  a 
MultiSearchForQueryThread  object  for  each  SearchModule  object 
in  the  system. 

2.  Each  SearchModule  object'  s  searchFor(Set<S  tring>  queries,  int 
returnSize)  m  ethod  has  executed,  term  inated,  and  returned  a 
Hashtable<String,SearchResults>  object. 

3.  The  ModularSearchEngine  object  co  llected  and  passed  all  of  the 
returned  Hashtable<String,Sear  chResults>  objects  from  post¬ 
condition  1  into  a  Mo  duleMixer  object  v  ia  the  ModuleMixer’  s 
mix(Hashtable<String,ArrayList<SearchResults» 
tableOfListedResults)  method. 

4.  The  ModuleMixer  m  ethod  from  post-condition  3  returned  a 
Hashtable<String,  SearchResults>  object. 

5.  A  status  message  was  displayed  to  the  user. 

D.  OBJECT  DESIGN 

The  system  analysis  conducted  in  the  pr  evious  sections  for  the  Modular  Search 
Engine  fram  ework  is  critical  for  identifying  the  necessary  objects  th  at  need  to  exist 
within  the  fram  ework  a  nd  how  those  objects  s  hould  interact  with  one  another.  This 

section  describes  those  objects  in  detail.  See  Appendix  for  class  diagram  reference. 
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1. 


Classes 


This  section  describes  the  non-abstract  classes  in  the  fra  mework,  with  the 
exception  of  the  Thread  classes.  The  customized  extensions  of  the  java.lang. Thread  class 
are  described  later  in  this  section. 

a.  ModularSearchEngine 

The  ModularSearchEngine  class  is  the  prim  ary  object  on  w  hich  all  use 
cases,  sequence  diagrams,  and  operational  contract  s  focus;  it  is  the  cen  tral  object  in  any 
application  developed  from  the  fram  ework.  Figure  1 1  is  the  UML  class  model  for  the 
ModularSearchEngine  class. 

_ ModularSearchEngine _ 

-corpus  :  Corpus 

-modules :  ArrayList<SearchModule> _ 

+addDocument(id  :  int,  doc  :  Document) :  boolean 

+buildlndex() :  boolean 

+deleteDocument(id  :  int) :  boolean 

+forceBuildlndex()  :  boolean 

+isReady() :  boolean 

+nextlD() :  Integer 

+seanchFor(qLiery  :  String,  returnSize  :  int) :  ArrayList<SearchResults> 

+searchFor(queries  :  Set<String>,  returnSize  :  int) :  Hashtable<String,  ArrayList<SearchResults>  > 
Figure  11.  UML  ModularSearchEngine  Class  Model 

(1)  Attributes 

Corpus  corpus :  This  private  variable  is  the  Cor  pus  on  whic  h  the 
ModularSearchEngine  performs  its  operations. 

ArrayList<SearchModule>  m  odules:  This  pr  ivate  var  iable  is  the 
container  for  all  of  the  SearchModules  in  the  system. 

(2)  Methods 

boolean  addDocum  ent(Documenf):  This  public  m  ethod  is  the 
interface  through  which  a  Docum  ent  is  adde  d  to  the  system  .  During  this  m  ethod’s 
execution,  the  provided  Docum  ent  is  first  added  the  Corpus  via  its  addDoc  method.  If 
adding  the  Document  to  the  Corpus  is  not  succ  essful,  this  method  prints  an  error,  returns 
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false,  and  term  inates.  Otherwise,  this  m  ethod  continues,  creating  and  starting  an 
AddDocumentThread  for  each  SearchModule  in  the  system.  Each  AddDocumentThread 
is  responsible  for  calling  the  addDoc  m  ethod  of  the  SearchModule  to  which  it  is 
assigned.  As  those  addDoc  m  ethods  tenn  inate,  each  AddDocumentThread  returns 
whether  or  not  its  addDoc  method  was  successful,  and  this  method  prints  an  appropriate 
message  reflecting  that  success  o  r  failure.  Once  all  of  the  AddDocum  entThreads  have 
tenninated,  if  there  were  any  failures,  then  this  method  displays  an  error  message,  returns 
false,  and  terminates.  If  there  were  no  failures,  then  this  method  displays  an  appropriate 
message,  returns  true,  and  terminates. 

boolean  deleteDocument(inf):  This  public  m  ethod  is  the  interface 
through  which  Documents  are  deleted  from  the  system;  the  provided  integer  corresponds 
to  th  e  uniq  ue  iden  tification  num  her  of  th  e  d  ocument  to  be  de  leted.  The  ind  icated 
Document  is  first  deleted  from  the  Corpus  vi  aits  deleteDoc  method.  If  deleting  the 
document  from  the  Corpus  is  not  successful  ,  this  m  ethod  prints  an  error,  returns  false, 
and  term  inates.  Otherwise,  this  m  ethod  continues,  crea  ting  and  starting  a 
DeleteDocumentThread  for  ea  ch  S  earchModule  in  the  system  .  Each 

DeleteDocumentThread  is  responsible  for  calling  th  e  deleteDoc  m  ethod  of  the 
SearchModule  to  which  it  is  a  ssigned.  As  tho  se  deleteDoc  m  ethods  term  inate,  each 
DeleteDocumentThread  returns  whether  or  not  its  deleteDoc  method  was  successful,  and 
this  method  prints  an  ap  propriate  message  reflecting  that  su  ccess  or  failure.  Once  all  of 
the  DeleteD  ocumentThreads  hav  e  tenn  inated,  if  there  were  any  failures,  this  m  ethod 
displays  an  error  m  essage,  returns  false,  and  terminates.  If  there  were  no  f  ailures,  then 
this  method  displays  an  appropriate  message,  returns  true,  and  tenninates. 

boolean  buildlndexQ :  T  his  public  method  is  the  interface  through 
which  a  user  ensures  th  at  an  appro  priate  index  is  built  for  each  Search  Module.  It  first 
creates  and  starts  a  BuildlndexThread  for  ea  ch  Search  Module  in  th  e  s  ystem,  each  of 
which  is  responsible  for  calling  the  buildlndex  method  of  the  SearchModule  to  which  it  is 
assigned.  As  those  buildlndex  m  ethods  tenn  inate,  each  BuildlndexThread  returns 
whether  or  not  its  buildlndex  m  ethod  was  successful,  and  this  m  ethod  prints  an 
appropriate  message  reflecting  that  success  or  failure.  Once  all  of  the  BuildlndexThreads 
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have  terminated,  if  there  were  any  failures,  this  method  displays  an  error  message,  returns 
false,  and  terminates.  If  there  were  no  failures,  then  this  method  displays  an  appropriate 
message,  returns  true,  and  tenn  inates.  This  m  ethod  allows  each  SearchModule  the 
opportunity  to  optimize  its  buildlndex  method  so  that,  if  possib  le,  a  new  index  m  ight  be 
built  upon  an  existing  o  ne.  This  w  ould  allow  the  system  to  save  reso  urces,  instead  of 
building  a  new  index  directly  from  the  Corpus  each  time. 

boolean  forceBuildlndexQ  :  This  public  m  ethod  is  the  interface 
through  which  a  user  forces  each  S  earchModule  to  build  a  new  index  directly  from  the 
Corpus.  It  first  creates  and  starts  a  ForceBuildlndexThread  for  each  SearchModule  in  the 
system,  each  of  which  is  responsible  for  calling  the  forceBuildlndex  m  ethod  of  the 
SearchModule  to  which  it  is  assigned.  As  those  forceBuildlndex  methods  terminate,  each 
ForceBuildlndexThread  retu  ms  whether  or  not  its  forceBuildlndex  m  ethod  was 
successful,  and  this  m  ethod  prints  an  approp  riate  m  essage  reflecting  that  success  or 
failure.  Once  all  of  the  ForceBuildlndexThr  ead  have  tenn  inated,  if  there  were  any 
failures,  this  m  ethod  displays  an  error  m  essage,  returns  false,  and  tenn  inates.  If  there 
were  no  failures,  then  this  m  ethod  disp  lays  an  appropriate  m  essage,  returns  true,  and 
tenninates.  This  method  is  the  complement  to  the  method  above,  and  its  primary  purpose 
is  to  be  used  when  the  user  suspects  that  an  index  has  becom  e  cor  rupted  on  disk. 

Additionally,  it  may  be  used  any  time  that  a  user  has  a  reason  to  give  the  system  a  “fresh 
start;”  however,  a  call  to  this  method  can  be  expected  to  take  a  significant  amount  of  time 
to  complete. 

boolean  isReadyQ:  This  public  m  ethod  is  the  interface  th  rough 
which  a  user  detenn  ines  if  the  system  is  rea  dy  to  receive  a  search  query.  It  first  creates 
and  starts  a  IsReadyThread  for  each  Search  Module  in  th  e  system  ,  e  ach  of  whi  eh  is 
responsible  for  calling  the  isReady  method  of  the  SearchModule  to  which  it  is  assigned. 
As  the  isReady  methods  terminate,  each  IsReadyThread  returns  the  status  of  its  isReady 
method,  and  this  m  ethod  prints  an  appropriate  message  reflecting  that  status.  If  any  of 
the  IsReadyThreads  indicated  that  its  Sear  chModule  was  not  ready,  then  this  m  ethod 
displays  an  error  message,  returns  false,  and  tenninates.  If  all  of  the  SearchModules  are 
ready,  then  this  method  displays  an  appropriate  message,  returns  true,  and  tenninates. 
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Integer  nextlDQ :  This  public  m  ethod  is  a  utility  to  be  used  while 
creating  n  ew  Docum  ents  because  each  Docu  ment  is  required  to  have  a  u  nique 
identification  number,  as  shown  later  in  this  chapter.  This  method  provides  the  user  with 
the  nex  t  av  ailable  integ  er  th  at  can  be  ass  igned  to  a  new  Docum  ent  for  entry  into  the 
Corpus  and  each  SearchModule.  Specifically,  it  calls  and  retu  ms  the  value  from  the 
Corpus’  protected  nextID  method  which  is  also  shown  later  in  the  chapter. 

ArrayList<SearchResults>  searchForf  String,  int)  :  This  p  ublic 
method  is  primary  interface  for  conducting  a  search  of  the  Corpus.  The  parameters  to  the 
method  are  the  query  S  tring  and  an  integer  that  indicates  the  number  of  results  to  return, 
e.g.  if  the  provided  integer  is  100,  then  the  each  SearchModul  e  returns  the  top  100 
Documents  that  m  atch  the  search  query.  If  the  provided  integer  is  greate  r  than  the 
number  of  Documents  in  the  Corpus,  it  is  treated  as  if  the  user  requested  the  results  for  all 
Documents.  This  method  first  creates  and  starts  a  SearchForThread  for  each 

SearchModule  in  the  sy  stem,  each  of  which  is  responsible  for  cal  ling  the  app  ropriate 
searchFor  method  of  t  he  SearchModule  to  which  it  is  assigned.  As  those  searchFor 
methods  term  inate  an  d  retu  m  S  earchResults,  each  SearchForThread  retu  rns  those 
SearchResults.  All  of  the  SearchResults  are  collected  into  an  ArrayList  and  then  returned 
by  this  method. 

Hashtable<String,ArrayList<SearchResults» 
searchFor(Set<String>,  int)  :  This  public  m  ethod  is  the  primary  interface  that  an  IR 
researcher  uses  conduct  batch  query  search  es.  This  m  ethod  allows  researchers  and 
developers  to  take  advantage  of  the  way  that  a  SearchModule  computes  the  relevance  of  a 
document  and  optim  ize  it,  if  possible,  f  or  pe  rforming  m  ultiple  sea  rch  qu  eries 
simultaneously.  The  param  eters  to  the  m  ethod  are  a  Set  of  query  Strings  and  an  integer 
that  indicates  the  num  ber  of  re  suits  that  should  be  returned  in  the  SearchResults.  This 
method  first  creates  and  starts  a  M  ultiSearchForThread  for  each  SearchModule  in  the 
system,  each  of  which  is  responsible  for  calling  the  appropriate  searchFor  method  of  the 
SearchModule  to  which  it  is  assigned.  Those  searchFor  methods  terminate  and  return  a 
Hashtable  of  SearchResults  which  are  indexed  by  the  String  used  to  produce  them .  Each 
MultiSearchForThread  return  s  tha  t  Hashta  ble  according  ly,  after  which  all  of  the 
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Hashtables  are  broken  down  to  produce  a  single  Hashtable  of  ArrayLists  of 

SearchResults  such  that  the  index  of  the  Ha  shtable  is  the  String  which  generated  the  list 
of  results. 


b.  Document 

The  essence  of  conducting  a  search  is  to  find  documents  that  are  relevant 
to  the  p  rovided  query,  and  as  such  ,  the  Docu  ment  class  is  the  b  asic  elem  ent  in  th  e 
Modular  Search  Engine  fra  mework.  Howe  ver,  the  provided  class  im  plementation 
represents  only  the  minimum  amount  of  information  necessary  to  comprise  the  concept  of 
a  document.  In  many  cases,  much  more  information  about  a  given  document  is  available, 
and,  as  such,  this  Docum  ent  class  should  be  extended  to  include  that  additional 
information  as  required.  Figure  12  is  the  UML  class  model  for  the  Document  class. 

_ Document _ 

-body :  String 

-id :  int _ 

+bodyLength() :  int 
+getBody() :  String 
+getlD() :  int 

+setBody(body  :  String)  :  void 
Figure  12.  UML  Document  Class  Model 

(1)  Attributes 

String  body:  This  private  variable  is  the  text  body  of  a  Document. 

int  id:  This  private  variable  is  the  unique  identification  number  of  a 
Document;  it  must  be  unique  amongst  all  the  other  Documents  in  a  given  Corpus. 

(2)  Methods 

int  bodyLengthQ:  This  public  m  ethod  allows  a  user  to  quickly  get 
the  length  of  the  Document’s  text,  without  having  to  get  the  entire  body  of  the  Document. 

String  getBodvO  :  This  public  m  ethod  allows  a  user  to  get  the 
entire  body  of  the  Document. 
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int  getlDQ:  This  public  m  ethod  allows  a  user  to  get  the  unique 
identification  number  of  a  Document. 

void  setBodyf  String):  This  public  m  ethod  allows  a  user  to  set  the 
text  body  of  a  Document. 


c.  DocScore 

Conceptually,  when  conducting  a  search,  documents  are  considered  in  turn 
and  evaluated  for  how  relevant  they  are  to  the  provided  quer  y.  The  DocScore  class  is  a 
customized  container  class  specifically  cr  eated  for  the  purpose  of  representing  that 
evaluation.  Figure  13  is  the  UML  class  model  for  the  DocScore  class. 

DocScore 

(implements  Comparable<DocScore>,  Comparator<DocScore>} 

-docID  :  Integer 
-docRank :  Integer 

-docScore  :  double _ 

+compare(ds1  :  DocScore,  ds2  :  DocScore) :  int 

+compareTo(ds  :  DocScore) :  int 

+id() :  Integer 

+rank() :  Integer 

+score() :  Double 

#setRank(rank  :  int) :  void 

+toString() :  String 


Figure  13.  UML  DocScore  Class  Model 


(1)  Attributes 

Integer  docID  :  This  private  va  riable  is  the  unique  iden  tification 
number  of  the  Document  to  which  this  DocScore  refers. 

Integer  docRank  :  This  private  variable  is  the  rank  given  to  the 

Document. 


Integer  docScore  :  This  private  va  riable  is  th  e  score  tha  t  the 
Document  receives  from  the  evaluation  process. 
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(2)  Methods 

int  comparc(DocScorc,  DocScore):  This  public  method  is  required 
by  the  implementation  of  the  java. lang. Comparator  interface.  This  method  assists  in  the 
sorting  of  DocScores.  W  hen  two  DocScore  s  are  com  pared  with  this  m  ethod,  it  will 
return  a  positive  integer  if  the  first  has  a  better  score  (ranked  higher)  than  the  second. 

int  compareTo(DocScore):  This  public  m  ethod  is  required  by  the 
implementation  of  the  java.lang. Comparable  interface.  This  method  assists  in  the  sorting 
of  DocScores  and  functions  in  the  same  manner  as  described  above 

Integer  idQ  :  This  public  m  ethod  allows  a  user  to  get  the  unique 
identification  number  of  the  Document  to  which  this  DocScore  refers. 

Integer  rankf):  This  public  m  ethod  allows  a  user  to  get  the  rank 
contained  within  the  DocScore. 

Double  scored :  This  public  m  ethod  allows  a  user  to  get  the  score 
contained  within  the  DocScore. 

void  setRank(int) :  This  protected  m  ethod  allows  a  user  to  set  the 
rank  contained  within  the  DocScore. 

String  toStringQ :  This  public  m  ethod  allows  a  user  to  get  a  String 
representation  of  the  DocScore  for  display  purposes. 

d.  SearcliResults 

The  DocScore  class  abo  ve,  for  all  p  ractical  purposes,  cannot  exist  alone 
because  th  e  inform  ation  contained  within  a  single  DocScore  is  useless  without  other 
DocScores  to  com  pare  against.  As  such,  th  e  SearchResults  class  h  as  been  created  as  a 
custom  container  class  designed  to  hold  all  of  the  DocScores  generated  from  a  s  ingle 
search  query.  Figure  14  is  the  UML  class  model  for  the  SearchResults  class. 
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SearchResults 

_ {implements  lterable<DocScore>) _ 

-dsVersion  :  int 
-firstPut :  boolean 
-putVersion :  int 
-query :  String 

-scoreTable  :  Hashtable<lnteger,  DocScore> 
-scoreTree  :  TreeSet<DocScore> 

-weight :  double 

-whoMadeMe  :  String _ 

-add(ds  DocScore) :  boolean 
+doclDs() :  Set<lnteger> 

+get(doclD  :  Integer) :  DocScore 
+getQuery() :  String 
+getWeight() :  double 
♦getWhoMadeMeQ :  String 
+iterator() :  lterator<DocScore> 

+put(id  :  int,  rank  :  int) :  boolean 

+put(id  :  int,  score  :  double) :  boolean 

+put(id  :  int,  score  :  double,  rank  :  int) :  boolean 

+put(ds  :  DocScore) :  boolean 

+setQuery(query  :  String) :  void 

+setRanks() :  void 

+setWeight(weight :  double) :  void 

+size() :  int 

♦toStringQ :  String 


Figure  14.  UML  SearchResults  Class  Model 


(1)  Attributes 

int  dsYe  rsion:  This  private  va  riable  en  sures  tha  t  a  11  of  the 
DocScores  contained  within  the  SearchResults  are  formatted  the  same.  For  example,  the 
user  is  prohibited  from  placing  a  DocScore  consisting  of  a  docID  and  docScore  into  a  set 
of  SearchResults  that  already  contains  DocScores  with  docID  and  docRank. 

boolean  firstPut :  This  private  variable  is  used  f  or  internal  record¬ 
keeping  in  conjunction  with  the  dsVersion  attribute  above. 

int  putVers  ion:  This  private  variable  is  used  for  internal  record¬ 
keeping  in  conjunction  with  the  dsVersion  and  firstPut  attributes  above. 
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String  query  :  This  private  variable  is  query  String  that  produces 

this  SearchResults. 

Hashtable<Integer,  DocScore>  scoreTable:  This  private  variable  is 
one  of  two  internal  containers  that  hold  DocScores.  It  allows  quick  access  to  a  DocScore 
that  is  associated  with  a  particular  Document. 

TreeSet<DocScore>  scoreTree:  This  private  variable  is  the  second 
internal  container  that  h  olds  DocScores.  It  alio  ws  for  the  quick  ordere  d  retrieval  o  f  all 
the  DocScores  contain  ed  within  because  the  DocScores  are  stored  in  sorted  orde  r 
according  to  the  compareTo  method  described  above. 

double  weight  :  This  p  rivate  var  iable  ass  igns  a  weight  to  the 
SearchResults  for  the  purpose  of  weighting  different  sets  of  results  against  one  another. 

String  whoMadeMe:  This  private  variable  stores  the  unique  String 
name  of  the  object  that  created  th  e  SearchResults.  This  variable  is  the  only  way  that  the 
set  of  SearchResults  is  tied  to  the  SearchModule  or  ModuleMixer  that  created  it. 

(2)  Methods 

boolean  add(DocScore) :  This  private  m  ethod  is  a  utility  m  ethod 
used  by  the  put  methods  described  below. 

Set<Integer>  docIDsQ:  This  public  method  allows  a  user  to  get  all 
of  the  Document  identification  numbers  contained  within  the  SearchResults. 

DocScore  get(lnteger):  This  public  method  allows  a  user  to  get  the 
DocScore  for  the  Docum  ent  whose  unique  identification  num  ber  corresponds  to  the 
provided  integer.  The  null  valu  e  is  returned  if  the  indicated  Document  does  not  exist  in 
the  SearchResults. 

String  getQueryQ  :  This  public  m  ethod  allows  a  user  to  get  the 
String  query  that  was  used  to  generate  the  SearchResults. 

double  getWeightQ :  This  public  m  ethod  alio  ws  a  user  to  get  the 
weight  of  the  SearchResults. 
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String  getWhoMadeMeQ:  This  public  method  allows  a  user  to  get 
the  name  of  the  object  that  created  the  SearchResults. 

Iterator<DocScore>  iteratorQ:  Implementing  the  java. lang.Iterable 
interface  requires  the  d  efinition  of  this  publ  ic  method.  Calling  th  is  method  returns  an 
Iterator  over  all  of  the  DocScores  in  the  S  earchResults.  T  his  function  allows  a  u  ser  to 
easily  create  a  prog  ramming  loop  to  iterate  through  the  results  via  the  for-each  loop 
construct. 

boolean  putfint,  int):  This  public  method  is  one  of  four  that  allows 
a  user  to  create  an  entry  in  the  SearchResults.  The  firs  t  parameter  corresponds  to  the 
unique  identification  number  of  the  Docum  ent  to  which  the  result  pertains;  the  second 
corresponds  to  the  rank  of  that  Docum  ent  when  compared  to  the  rest  of  the  Documents. 
This  method  creates  a  D  ocScore  with  the  provided  param  eters  and  then  calls  the  private 
add  method  to  store  the  DocScore  in  the  SearchResults. 

boolean  putfint,  double) :  This  public  m  ethod  is  the  second  of  four 
that  allows  a  user  to  create  an  entry  in  the  SearchR  esults.  The  first  param  eter 
corresponds  to  the  unique  identification  num  ber  of  the  Docum  ent  to  which  the  result 
pertains;  the  second  corresponds  to  the  scor  e  that  the  Docum  ent  received  from  the 
method  or  object  that  evaluate  d  it.  This  m  ethod  creates  a  DocScore  with  the  provided 
parameters  and  then  c  alls  the  private  add  m  ethod  to  store  the  DocScore  in  the 
SearchResults. 

boolean  putfint,  double,  int)  :  This  public  m  ethod  is  the  third  of 
four  that  allows  a  user  to  create  an  entry  in  the  SearchResults;  it  is  a  co  mbination  of  the 
two  put  m  ethods  above.  The  first  param  eter  corresponds  to  the  unique  identification 
number  of  the  Document  to  which  the  result  pertains;  the  second  corresponds  to  the  score 
that  the  Do  cument  received  from  the  m  ethod  or  object  that  evalu  ated  it;  the  third 
corresponds  to  the  rank  of  that  Docum  ent  when  compared  to  the  rest  of  the  Documents. 
This  method  creates  a  D  ocScore  with  the  provided  param  eters  and  then  calls  the  private 
add  method  to  store  the  DocScore  in  the  SearchResults. 
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boolean  put(DocScore) :  This  public  m  ethod  is  the  last  of  four  that 
allows  a  us  er  to  create  an  entry  in  the  Sear  chResults.  The  user  can  ch  oose  to  create  a 
DocScore  directly  and  then  use  this  m  ethod  which  will  call  the  private  add  m  ethod  to 
store  the  DocScore  in  the  SearchResults. 

void  setQuervl String):  This  public  method  allows  a  user  to  set  the 
query  attribute  that  was  used  to  create  this  SearchResults. 

void  setRanksQ  :  T  his  public  m  ethod  allows  a  user  to 
automatically  set  the  ranks  of  all  the  DocScores  contained  within  the  SearchResults.  This 
method  is  only  applicable  if  the  DocSco  res  do  not  already  have  assigned  ranks. 
DocScores  are  sorted  according  to  their  sc  ore  attribute  and  assigned  a  rank,  accordin  gly, 
such  that  the  DocScore  with  the  highest  score  is  assigned  a  rank  of  one. 

void  setW eight! double):  This  public  m  ethod  allows  a  user  to  set 
the  weight  attribute  of  the  SearchR  esults  for  later  use  when  com  paring  SearchResults 
against  one  another. 

2.  Abstract  Classes 

Abstract  classes  are  classes  that  cannot  be  instantiated;  they  must  be  extended  into 
a  non-abstract  child  class  in  order  to  gain  this  capability.  Below  are  the  two  abstract 
classes  in  the  Modular  Search  Engine  framework. 

a.  Corpus 

In  the  field  of  IR,  a  collection  of  documents  that  have  similar  structure  is  a 
corpus.  As  such,  the  abstra  ct  Corpus  class  has  been  deve  loped  for  the  Modular  Search 
Engine  fram  ework.  It  is  abs  tract  because  corpora  vary  g  reatly  from  one  another,  the 
details  of  which  this  author  does  not  presum  e  to  know.  Therefore,  it  is  up  to  the  user  to 
extend  this  abstract  class  and  conf  orm  it  to  th  e  preexisting  structure  of  a  select  corpus. 
All  of  the  methods  in  the  abstract  Corpus  class  are  also  abstract  and  must  be  implemented 
to  allow  the  functionality  described  below.  Figure  15  is  the  UML  class  m  odel  for  the 
abstract  Corpus  class. 
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Corpus 

{implements  lterable<Document>} 

#addDoc(id :  int,  doc  :  Document)  :  boolean 
+clone() :  Corpus 
UdeleteDoc(id :  int)  :  boolean 
+getDoc(id  :  int) :  Document 
+idSet() :  Set<lnteger> 

+iterator() :  lterator<Document> 

+name() :  String 
UnextlD() :  Integer 
+size() :  int 


Figure  15.  UML  Corpus  Class  Model 


(1)  Attributes 

None. 

(2)  Methods 

boolean  addDocfDocum  ent):  Th  is  p  rotected  abstrac  t  m  ethod 
allows  a  user  to  add  a  Document  to  the  Corpus. 

Corpus  cloneQ:  This  public  abstract  m  ethod  allows  a  user  to  get  a 
deep  copy  of  the  Corpus. 

boolean  deleteDoc(inf) :  This  protected  abst  ract  m  ethod  allows  a 
user  to  delete  a  Document  from  the  Corpus. 

Document  gctDoc(int) :  This  public  abstract  method  allows  a  user 
to  retrieve  the  Document  who’s  unique  id  cntilication  num  ber  m  atches  the  pro  vided 
integer. 

Set<Integer>  idSetQ:  This  public  abstract  method  allows  a  user  to 
get  all  of  the  Document  identification  numbers  contained  within  the  Corpus. 

Iterator<Document>  ite  ratorQ:  Im  plementing  the 

java.lang.Iterable  interface  requires  the  definitio  n  of  this  public  m  ethod.  Calling  this 
method  returns  an  Iterator  over  all  of  the  Docu  ments  in  the  Corpus.  This  function  allows 
the  user  to  easily  create  a  programming  loop  to  iterate  through  the  Documents  via  the  for- 
each  loop  construct. 
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String  nameQ:  This  public  abstract  m  ethod  allows  the  user  to  get 
the  name  of  the  Corpus.  Each  child  extended  from  this  abstract  parent  class  should  have 
a  unique  String  returned  by  this  function  so  that  the  Corpus  can  be  identified  at  runtime. 

Integer  nextlDQ :  This  protected  abstract  method  allows  a  user  to 
get  the  next  available  identification  number  that  can  be  used  to  put  a  new  Docum  ent  into 
the  Corpus. 

int  sized :  This  public  abstract  method  allow  s  a  user  to  get  the 
number  of  Documents  in  the  Corpus. 

b.  SearchModule 

The  heart  of  any  search  engine  is  the  unique  m  ethod  with  which  it 
performs  its  prim  ary  function:  to  search.  The  goal  behind  the  M  odular  Search  Engine 
framework  is  to  im  plement  m  ultiple  dif  ferent  IR  te  chniques  sim  ultaneously  within  a 
single  search  engine.  As  such,  th  e  abs  tract  S  earchModule  class  is  the  heart  of  the 
Modular  Search  Engine  fra  mework.  Users  ar  e  able  to  extend  this  abstract  class  and 
implement  existing  and  new  IR  techniques  tha  t  will  integrate  seamlessly  with  each  other 
within  the  framework.  Figure  16  is  the  UML  class  model  for  the  abstract  SearchModule 
class. 

_ SearchModule _ 

#corpus  :  Corpus _ 

+addDocument(id :  int,  doc  Document) :  boolean 

+buildlndex() :  boolean 

+deleteDocument(id :  int):  boolean 

+forceBuildlndex()  :  boolean 

+isReady()  :  boolean 

+name() :  String 

+searchFor(query :  String,  returnSize  :  int) :  SearchResults 

+searchFor(queries  :  Set<String>,  returnSize  :  int) :  Hashtable<String,  Search  Re  sults> 
Figure  16.  UML  SearchModule  Class  Model 

(1)  Attributes 

Corpus  corpus:  This  protected  variable  is  the  Corpus  on  which  the 


SearchModule  performs  its  operations. 
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(2)  Methods 

boolean  addDocum  cntf  Document):  This  public  m  ethod  allows  a 
user  to  add  a  Document  to  the  SearchModule. 

boolean  deleteDocument(inf):  This  public  method  allows  a  user  to 
delete  Documents  from  the  SearchModule. 

boolean  buildlndexQ  :  This  public  m  ethod  allows  the  user  to 
ensure  that  an  appropriate  i  ndex  is  built  for  the  SearchMo  dule.  This  m  ethod  allows  a 
SearchModule  the  opport  unity  to  optim  ize  its  buildlndex  method  so  that,  if  possible,  a 
new  index  might  be  built  upon  an  existing  one.  This  allows  the  system  to  save  resources, 
instead  of  building  a  new  index  directly  from  the  Corpus  each  time. 

boolean  forceBuildlndexQ :  This  public  m  ethod  allows  a  user  to 
forcibly  direct  the  SearchModule  to  build  a  new  index  directly  from  the  Corpus.  This 
method  is  the  complement  to  the  method  above;  it  is  used  when  the  user  suspects  that  an 
index  has  become  corrupted.  A  call  to  this  method  can  be  expected  to  take  a  significant 
amount  of  time  to  complete. 

boolean  isReadyQ:  This  public  m  ethod  is  the  interface  th  rough 
which  a  user  determines  if  the  SearchModule  is  ready  to  receive  a  search  query. 

String  named:  This  public  method  allows  the  user  to  get  the  name 
of  the  SearchModule.  Each  child  extended  from  this  abstract  parent  class  should  have  a 
unique  String  returned  by  this  function  so  that  the  SearchModule  can  be  differentiated 
from  other  SearchModules  at  runtime. 

SearchResults  searchForf  String,  int)  :  This  public  m  ethod  is 
primary  interface  for  conducting  a  search  with  the  SearchModule.  The  parameters  to  the 
method  are  the  query  S  tring  and  an  integer  that  indicates  the  number  of  results  to  return, 
e.g.,  if  the  provided  integer  is  100,  then  the  each  SearchModule  should  return  the  top  100 
Documents  that  m  atch  my  search  query.  If  the  provided  integer  is  greater  than  the 
number  of  Documents  in  the  Corpus,  it  is  treated  as  if  the  user  requested  the  results  for  all 
Documents. 
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Hashtable<String,  SearchResults>  searchFor(Set<String>,  int)  : 
This  public  m  ethod  is  t  he  prim  ary  interface  through  which  an  IR  researcher  conducts 
batch  query  searches.  T  his  method  allows  researchers  and  developers  to  take  advantage 
of  the  way  in  which  the  SearchModule  co  mputes  the  relevance  of  a  docum  ent  and 
optimize  it,  if  possible  ,  f  or  perf  orming  multiple  sea  rch  queries  simultaneous  ly.  The 
parameters  to  the  m  ethod  are  a  Set  of  query  Strings  and  an  intege  r  that  indicates  the 
number  of  results  that  should  be  returned  in  each  SearchResults. 

3.  Interface 

Like  an  abstract  class,  an  interface  cannot  be  instantiated  on  its  own.  An  interface 
must  be  implem  ented  by  the  user,  and  that  im  plementation  must  adhere  to  the  structure 
defined  in  the  interface.  The  Modular  S  earch  Engine  fr  amework  contains  a  single 
interface,  detailed  below. 

a.  ModuleMixer 

In  the  field  of  IR,  metasearch  is  the  process  of  combining  multiple  ranked 
lists  of  docum  ents  to  produce  a  single  list  that  is  better  than  any  one  of  the  lists  that 
generated  it.  Since  the  Modular  Search  E  ngine  fram  ework  is  designed  to  work  with 
multiple  IR  m  ethods  s  imultaneously,  integ  rating  m  etasearch  into  th  e  fram  ework  is 
essential  in  the  design.  Implementing  a  metasearch  technique  is  accomplished  through  the 
ModuleMixer  interface. 

Figure  17  is  the  UML  model  for  the  ModuleMixer  interface. 

« interface# 

ModuleMixer 

+mix(listOfResults  :  ArrayList<SearchResults>) :  SearchResults 

+mix(tableOfListedResults  :  Hashtable<String,  ArrayList<SearchResults>  >) :  Hashtable<String,  SearchResults> 

Figure  17.  UML  ModuleMixer  Interface  Model 

(1)  Attributes 
None. 
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(2)  Methods 

SearchResults  m  ix(ArrayList<SearchResults>):  This  public 
method  is  designed  to  accompany  the  single  query  searchFor  method.  It  allows  a  user  to 
create  a  single  set  of  SearchResults  from  the  provided  ArrayList  of  SearchResults  via  the 
metasearch  method  implemented  by  the  ModuleMixer. 

Hashtable<Strin»,  SearchResu  lts>  m  ix(Hashtable<String, 
ArrayList<SearchResults»):  This  public  method  is  designed  to  accompany  the  multiple 
query  searchFor  method.  It  allows  a  user  to  create  a  single  set  of  SearchResults  for  each 
Arraylist  of  SearchResults  in  the  provide  d  Hashtable  via  the  m  etasearch  method 
implemented  by  the  ModuleMixer. 

4.  Threads 

The  Modular  Search  Engine  firam  ework  c  ontains  seven  class  extensions  of  the 
java.lang. Thread  class.  Each  is  designed  to  carry  out  one  of  the  use  cases  described  in 
Chapter  II  and  is  responsible  for  ha  ndling  the  communication  between  the 

ModularSearchEngine  and  a  SearchModule  within  the  system  .  The  details  of  all  seven 
are  described  below. 


a.  AddDocumentThread 


Figure  18  is  the  UML  class  model  for  the  AddDocumentThread  class. 

AddDocumentThread 
{extends  Thread} 

-doc :  Document 
-id :  int 

-sm :  SearchModule 

-success  :  boolean _ 

+name() :  String 
+run() :  void 
♦successful) :  boolean 


Figure  18.  UML  AddDocumentThread  Class  Model 
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(1)  Attributes 

Document  doc:  This  private  variable  is  the  Document  to  be  added. 

int  id  :  Th  is  priva  te  variab  le  is  the  unique  identif  ier  of  the 
Document  to  be  added. 

SearchModule  sm  :  This  private  variab  le  is  the  SearchModule 
whose  addDocument  method  will  be  called  by  this  AddDocumentThread. 

boolean  success:  This  private  var  iable  holds  the  returned  result  of 
the  SearchModule’s  addDocument  method. 

(2)  Methods 

String  nam  eQ:  This  public  m  ethod  allows  a  user  to  obtain  the 
name  of  the  SearchModule  that  this  AddDocumentThread  is  associated  with. 

void  runQ  :  Extending  the  java.lang.T  hread  class  requires  the 

definition  of  this  public  m  ethod.  It  calls  the  addDocument  method  of  the  SearchModule 
assigned  to  this  AddDocumentThread. 

boolean  su  ccessfulQ:  This  public  m  ethod  allows  a  user  to 
determine  if  the  Document  was  successfully  added  to  the  SearchModule. 

b.  DeleteDocumentThread 

Figure  19  is  the  UML  class  model  for  the  DeleteDocumentThread  class. 

DeleteDocumentThread 
{extends  Thread} 

-id :  int 

-sm :  SearchModule 

-success  :  boolean _ 

+name() :  String 
+run()  :  void 
♦successful) :  boolean 

Figure  19.  UMF  DeleteDocumentThread  Class  Model 
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(1)  Attributes 

int  id  :  Th  is  priva  te  variab  le  is  the  unique  identif  ier  of  the 
Document  to  be  deleted. 

SearchModule  sm  :  This  private  variab  le  is  the  SearchModule 
whose  deleteDocument  method  will  be  called  by  this  DeleteDocumentThread. 

boolean  success:  This  private  var  iable  holds  the  returned  result  of 
the  SearchModule’s  deleteDocument  method. 

(2)  Methods 

String  nam  eQ:  This  public  m  ethod  allows  a  user  to  obtain  the 
name  of  the  SearchModule  that  this  DeleteDocumentThread  is  associated  with. 

void  runQ  :  Extending  the  java.lang.T  hread  class  requires  the 
definition  of  this  pu  blic  m  ethod.  It  calls  the  deleteDocument  m  ethod  of  the 
SearchModule  assigned  to  this  DeleteDocumentThread. 

boolean  su  ccessfulQ:  This  public  m  ethod  allows  a  user  to 
determine  if  the  Document  was  successfully  deleted  from  the  SearchModule. 

c.  BuildlndexThread 

Figure  20  is  the  UML  class  model  for  the  BuildlndexThread  class. 

BuildlndexThread 
{extends  Thread) 

-sm :  SearchModule 

-success  :  boolean _ 

+name() :  String 
+run()  :  void 
•^-successful () :  boolean 

Figure  20.  UMF  BuildlndexThread  Class  Model 

(1)  Attributes 

SearchModule  sm  :  This  private  variab  le  is  the  SearchModule 
whose  buildlndex  method  will  be  called  by  this  BuildlndexThread. 
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boolean  success:  This  private  var  iable  holds  the  returned  result  of 
the  SearchModule’s  buildlndex  method. 

(2)  Methods 

String  nam  eQ:  This  public  m  ethod  allows  a  user  to  obtain  the 
name  of  the  SearchModule  that  this  BuildlndexThread  is  associated  with. 

void  runQ  :  Extending  the  java.lang.T  hread  class  requires  the 
definition  of  this  public  m  ethod.  It  calls  the  buildlndex  method  of  the  SearchModule 
assigned  to  this  BuildlndexThread. 

boolean  su  ccessfulQ:  This  public  m  ethod  allows  a  user  to 
determine  if  the  SearchModule’s  buildlndex  method  was  successful. 

d.  ForceBuildlndexThread 

Figure  21  is  the  UML  class  model  for  the  ForceBuildlndexThread  class. 

ForceBuildlndexThread 
{extends  Thread} 

-sm :  SearchModule 

-success  :  boolean _ 

+name() :  String 
+run()  :  void 
•►successful () :  boolean 

Figure  2 1 .  UMF  ForceBuildlndexThread  Class  Model 

(1)  Attributes 

SearchModule  sm  :  This  private  variab  le  is  the  SearchModule 
whose  forceBuildlndex  method  will  be  called  by  this  ForceBuildlndexThread. 

boolean  success:  This  private  var  iable  holds  the  returned  result  of 
the  SearchModule’s  forceBuildlndex  method. 

(2)  Methods 

String  nam  eQ:  This  public  m  ethod  allows  a  user  to  obtain  the 
name  of  the  SearchModule  that  this  ForceBuildlndexThread  is  associated  with. 
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void  runO  :  Extending  the  java.lang.T  hread  class  requires  the 
definition  of  this  pu  blic  m  ethod.  It  calls  the  forceBuildlndex  m  ethod  of  the 
SearchModule  assigned  to  this  ForceBuildlndexThread. 

boolean  su  ccessfulQ:  This  public  m  ethod  allows  a  user  to 
determine  if  the  SearchModule’s  forceBuildlndex  method  was  successful. 

e.  IsReadyThread 

Figure  22  is  the  UMF  class  model  for  the  IsReadyThread  class. 

IsReadyThread 
{extends  Thread} 

-sm :  SearchModule 

-ready :  boolean _ 

+name() :  String 
+ready() :  boolean 
+run() :  void _ 

Figure  22.  UMF  IsReadyThread  Class  Model 

(1)  Attributes 

SearchModule  sm  :  This  private  variab  le  is  the  SearchModule 
whose  isReady  method  will  be  called  by  this  IsReadyThread. 

boolean  ready :  This  private  variable  holds  the  returned  result  of 
the  SearchModule’s  isReady  method. 

(2)  Methods 

String  nam  eQ:  This  public  m  ethod  allows  a  user  to  obtain  the 
name  of  the  SearchModule  that  this  IsReadyThread  is  associated  with. 

boolean  ready!) :  This  public  m  ethod  allows  a  user  to  determ  ine  if 
the  SearchModule  is  ready  to  receive  a  search  query. 

void  runQ  :  Extending  the  java.lang.T  hread  class  requires  the 
definition  of  this  public  m  ethod.  It  calls  the  isReady  m  ethod  of  t  he  SearchModule 
assigned  to  this  IsReadyThread. 
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f.  SearchForQueryThread 


Figure  23  is  the  UML  class  model  for  the  SearchForQueryThread  class. 

SearchForQueryThread 

_ {extends  Thread} _ 

-query  :  String 
-results  :SeanchResults 
-retumSize :  Integer 

-sm :  SearchModule _ 

+getResults() :  SearchResults 
+name() :  String 
+run() :  void 


Figure  23.  UML  SearchForQueryThread  Class  Model 


(1)  Attributes 

String  query :  This  priv  ate  variable  is  Str  ing  to  search  f  or  and  is 
passed  as  a  parameter  to  the  SearchModule’s  searchFor  method. 

SearchResults  results  :  This  pr  ivate  variable  h  olds  th  e  re  turned 
result  of  the  SearchModule’s  searchFor  method. 

Integer  returnSize:  This  private  variable  is  passed  as  a  param  eter 
to  the  SearchModule’s  searchFor  method  to  i  ndicate  the  size  of  the  SearchResults  to 
return. 

SearchModule  sm  :  This  private  variab  le  is  the  SearchModule 
whose  searchFor  method  will  be  called  by  this  SearchForQueryThread. 

(2)  Methods 

SearchResults  getResultsQ :  This  public  m  ethod  allows  a  user  to 
get  the  results  of  the  search  query. 

String  nam  eQ:  This  public  m  ethod  allows  a  user  to  obtain  the 
name  of  the  SearchModule  that  this  SearchForQueryThread  is  associated  with. 
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void  runO  :  Extending  the  java.lang.T  hread  class  requires  the 
definition  o  f  this  pub  lie  m  ethod.  It  calls  the  searchFor  m  ethod  of  the  SearchModule 
assigned  to  this  SearchForQuery Thread. 

g.  MultiSearchForThread 

Figure  24  is  the  UMF  class  m  odel  f  or  the  MultiSea  rchForQuery Thread 

class. 

Multi  SearchForQueryThread 

_ (extends  Thread) _ 

-querys  :  Set<String> 

-results  :  Hashtable<String,  SearchResults> 

-retumSize :  Integer 

-sm :  SearchModule _ 

+getResults() :  Hashtable<String,  SearchResults> 

+name() :  String 
+run() :  void 

Figure  24.  UMF  MultiSearchForQueryThread  Class  Model 

(1)  Attributes 

Set<String>  queries:  T his  private  variable  is  th e  Set  of  Strings  to 
search  for  and  is  passed  as  a  parameter  to  the  SearchModule’s  searchFor  method. 

Hashtable<String,  SearchResults>  results  :  Th  is  pr  ivate  v  ariable 
holds  the  returned  result  of  the  SearchModule’s  searchFor  method. 

Integer  returnSize:  This  private  variable  is  passed  as  a  param  eter 
to  the  SearchModule’s  searchFor  method  to  i  ndicate  the  size  of  the  SearchResults  to 
return. 

SearchModule  sm  :  This  private  variab  le  is  the  SearchModule 
whose  searchFor  method  will  be  called  by  this  MultiSearchForQueryThread. 

(2)  Methods 

Hashtable<String,  Search  Results>  getResultsQ  :  Th  is  public 
method  allows  a  user  to  get  the  results  of  the  batch  search  query. 
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String  nam  eQ:  This  public  m  ethod  allows  a  user  to  obtain  the 
name  of  the  SearchModule  that  this  MultiSearchForQueryThread  is  associated  with. 

void  runQ  :  Extending  the  java.lang.T  hread  class  requires  the 
definition  o  f  this  pub  lie  m  ethod.  It  calls  the  searchFor  m  ethod  of  the  SearchModule 
assigned  to  this  MultiSearchForQueryThread. 

5.  Packages 

The  Modular  Search  Engine  fram  ework  is  divided  into  three  prim  ary  packages 
that  serve  to  organize  the  classes,  interfaces  ,  and  extensio  ns  into  logical  g  roups.  The 
packages  also  serve  to  ensure  that  the  protected  variables  are  only  directly  accessible  by 
objects  within  the  same  package.  The  three  packages  are  described  below. 

a.  modularSearchEngine 

The  modularSearchEngine  package  consists  of  the  following: 

•  Corpus — Abstract  Class 

•  Document — Class 

•  ModularSearchEngine — Class 

•  ModuleMixer — Interface 

b.  SearchModule 

The  SearchModule  package  consists  of  the  following: 

•  DocScore — Class 

•  SearchModule — Abstract  Class 

•  SearchResults — Class 
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c.  modularSearchEngineThreads 

The  m  odularSearchEngineThreads  package  consists  of  the  following 
seven  class  extensions  of  java.lang. Thread: 

•  AddDocumentThread 

•  BuildlndexThread 

•  DeleteDocumentThread 

•  ForceBuildlndexThread 

•  IsReadyThread 

•  MultiSearchForQueryThread 

•  SearchForQueryThread 


57 


THIS  PAGE  INTENTIONALLY  LEFT  BLANK 


58 


IV.  REFERENCE  IMPLEMENTATION 


A.  OVERVIEW 

As  a  proof  of  concept,  we  have  de  veloped  a  reference  im  plementation  to 
demonstrate  the  abilities  of  the  Modular  Search  Engine  fram  ework.  This  chapter 
describes  the  internal  com  ponents  of  th  e  reference  im  plementation  and  shows  the 
Graphical  User  In  terface  (GUI)  d  esigned  to  provide  the  user  with  a  sim  pie  working 
environment. 

B.  EXTENSIONS  AND  IMPLEMENTATIONS 

As  described  in  the  previous  chapter,  several  components  of  the  Modular  Search 
Engine  framework  must  be  extended  or  implemented.  Specifically,  the  user  must  extend 
the  abstract  Corpus  and  SearchModule  classes  and  imple  ment  the  ModuleMixer 

interface.  The  reference  implementation  contains  four  child  classes  of  Corpus,  two  child 
classes  of  SearchModule,  and  two  imple  mentation  classes  of  ModuleMixer.  These  are 
described  below. 

1.  Corpora 

The  reference  implementation  includes  four  standard  benchmark  corpora  that  are 
used  frequently  in  IR  [3].  The  corpora  were  attained  f rom  the  University  of  Glasgow’s 
IR  Group  and  are  as  follows:  C ran  field,  Medline,  CISI,  and  Ti  me  [7],  Each  of  the  four 
Corpus  classes  was  developed  by  extending  the  base  Corpus  class  and  adapting  it  to  the 
specifics  of  each  data  set.  However,  only  one  is  active  at  a  time,  as  chosen  by  the  user. 

2.  SearchModules 

There  are  two  SearchModules  included  in  this  exam  pie  application;  they  are 
individually  described  below. 
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a.  TF-IDF  SearchModule 

Term  Frequency-Inverse  Docum  ent  Frequency  (TF-IDF)  is  a  basic 
keyword  matching  technique  and  is  the  basis  for  one  of  the  two  SearchModules  in  the 
reference  implementation.  The  essentials  of  TF-IDF  are  explained  below. 

One  way  to  repres  ent  a  docum  ent  is  as  a  vector  of  the  fir  equencies  of  the 
words  contained  within  it.  For  example,  c  onsider  a  document  whose  entirety  consists  of 
the  following  sentence:  “The  boy  fed  the  dog.”  The  document  is  five  words  long,  but  it 
only  contains  four  unique  words  because  the  word  “the”  is  used  twice;  we  would  say  that 
that  this  document  has  five  tokens,  but  only  four  types.  We  assign  an  index  to  each  type 
and  count  the  number  of  times  each  appears  in  the  document.  Dividing  by  the  sum  of  the 
counts  (the  total  number  of  words  in  the  document)  will  yield  the  term  frequency  for  each 
type.  The  table  below  shows  these  values  for  the  example. 


Index  1 

ype 

Count 

Term 

Frequency 

0  the 

2 

2/5  =  0.4 

1  bo> 

1 

1/5  =  0.2 

2  fed 

1 

1/5  =  0.2 

3  do£ 

1 

1/5  =  0.2 

Table  1.  Term  Frequency  Example  Table 


We  can  now  generalize  the  above  process.  Let  c  q  be  the  count  of  wor  d  i 
in  document  j.  We  can  then  calculate  tfi7,  the  term  frequency  of  word  i  in  document  j: 


Now  that  we  have  a  11  of  the  term  f  requencies  in  a  docu  ment,  we  can 
represent  that  document  as  a  single  column  vector:  tf}  =  [  tf lyj ,  tf 2j ,  ...  ,  tf  VJ  ]T  where  V is 
the  total  number  of  unique  words  in  our  vocabulary. 
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So  far,  the  above  process  weights  the  relevance  of  a  word  according  to  the 
frequency  in  which  that  word  appears  in  a  document.  This  reflects  the  intuition  that  the 
more  frequent  terms  in  a  document  may  reflect  the  meaning  of  that  document  better  than 
the  terms  that  appear  less  frequently  and,  t  hus,  should  have  stronger  weights  [8,  9].  We 
now  turn  o  ur  attention  to  th  e  f  act  that  we  ar  e  dealing  w  ith  m  ultiple  docum  ents  that 
comprise  a  corpus. 


Consider  a  word  that  appears  in  every  document  in  the  corpus.  This  word 
has  little  po  wer  when  trying  to  ide  ntify  the  relevance  of  one  docum  ent  over  another. 
Conversely,  consider  a  word  that  appears  in  only  a  single  document.  The  opposite  is  true 
because  this  word  carries  a  lot  of  importance  in  identifying  this  particular  document  when 
compared  t  o  all  the  others.  Thus,  we  should  weight  those  words  which  are  common 
across  many  documents  lower  than  those  whic  h  appear  in  o  nly  a  few  documents  [8,  9]. 
As  such,  a  new  m  easure  known  as  the  inve  rse  docum  ent  frequency  (IDF)  com  es  into 
play.  IDF  i  s  defined  as  N  /  n where  N  is  the  total  num  her  of  documents  in  the  corpus, 
and  n,  is  the  num  her  of  docum  ents  in  which  word  i  appears.  In  order  to  discount  the 
weight  of  a  word  that  appears  in  m  any  documents,  this  measure  is  app  lied  within  a  log 
function  resulting  in  the  following  definition  for  the  inverse  document  frequency  of  word 

/:  [9] 


idfi  =  log 


V  ni  J 

If  word  i  appears  in  every  d  ocument,  then  n  ,-  =  N,  and  thus 

idfi  =  log(  1)  =  0.  Whe  n  applied  to  every  wo  rd  in  the  vocabulary,  this  yields  an  IDF 
vector  with  dimension  equal  to  V. 


Combining  term  frequency  (TF)  with  IDF  results  in  the  TF-IDF  weighting 
scheme  such  that  the  weight  of  word  i  in  document  j  is  the  product  of  its  frequency  in  j 
with  the  log  of  its  inve  rse  document  frequency  in  the  co  rpus:  Wy  =  tf  ,t/  *  idf ,  [9].  Thi  s 
yields  a  matrix  with  dimension  Vx  N,  such  that  each  column  in  the  matrix  is  the  TF-IDF 
weight  vector  of  a  single  docum  ent.  We  then  use  the  Euclidian  nonn  on  each  of  these  to 
produce  document  weight  vectors  whose  lengths  are  exactly  one. 
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The  TD-IDF  matrix  and  the  IDF  vector  together  comprise  the  index  of  the 
corpus,  and  calculating  these  for  a  fi  xed  corpus  needs  only  take  place  on  ce.  They  can  be 
stored  on  disk  and  recalled  for  subsequent  r  uns  of  the  reference  im  plementation.  Up  to 
this  point,  all  of  the  above  calculations  have  been  performed  on  the  corpus,  and  we  now 
turn  the  attention  to  how  to  conduct  a  search  query  using  TF-IDF. 

First,  the  query  string  is  converted  into  a  TF  vector  in  the  same  manner  as 
each  document  is  above.  W  e  then  calculate  the  element-wise  product  of  the  TF  vector 
and  the  corpus’  IDF  vector  to  produce  a  new  TF-I  DF  vector  for  the  query.  This  vector  is 
nonnalized  via  the  Euclidian  no  rm,  and  now  can  be  used  to  determine  how  relevant  each 
document  in  the  corpus  is  to  the  pr  ovided  query.  The  TF-IDF  SearchModule 
accomplishes  this  by  co  mputing  the  cosine  s  imilarity  (via  the  dot  product  of  normalized 
vectors)  between  the  qu  ery  TF-IDF  vector  a  nd  the  TF-IDF  vector  for  each  docum  ent  in 
the  corpu  s  (aka  the  co  lumns  of  the  m  atrix.)  This  is  acco  mplished  by  a  single  matrix 
multiplication:  transpose  the  query  TF-IDF  column  vector  into  a  row  vector  and  multiply 
it  by  the  TF  -IDF  matrix  of  the  corp  us.  The  r  esulting  vector  contains  the  scalar  co  sine 
similarity  measure  between  each  document  in  the  corpus  and  the  provided  query.  Sorting 
in  descending  order  according  to  this  m  easure  will  yield  an  ordere  d  list  of  docum  ents 
such  that  the  most  similar  documents  are  at  the  top  of  the  list  [8-10]. 

It  should  be  noted  that  the  vector  and  matrix  mathematics  implemented  in 
this  implementation  of  TF-IDF  is  accomplished  via  the  Colt  Project,  a  set  of  open  source 
java  libraries  published  by  the  European  Or  ganization  for  Nuclear  Research  (CERN) 
[11]. 


b.  Draeger’s  LDA  SearchModule 

As  m  entioned  in  Chapter  II,  Draeger  used  the  Modular  S  earch  Engine 
framework  to  im  plement  a  new  IR  technique  to  conduct  sem  antic  search.  During  the 
course  of  his  research,  he  developed  a  SearchModule  based  upon  Latent  Dirichlet 
Allocation  (LDA)  [3]. 

LDA  is  a  param  etric  Bayesian  model  that  genera  tes  a  prob  ability 


distribution  over  the  topics  covered  in  a  docum  ent,  and  each  topic  is  a  distribution  over 
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the  words  in  a  vocabu  lary.  Thes  e  topics  form  a  latent  feature  s  et  that  describ  es  a 
document  collec  tion  be  tter  than  th  e  words  alo  ne.  Using  this  m  odel,  it  is  pos  sible  to 
perform  a  search  by  using  the  words  in  the  query  to  infer  the  most  likely  topics  associated 
with  that  query  and  then  find  the  documents  that  cover  these  same  topics  [3,  12]. 

As  a  dem  onstration  of  the  m  odularity  of  the  Modular  S  earch  Engine 
framework,  we  have  taken  Draeger’s  LDA  S  earchModule  and  incorporated  it  directly 
into  the  reference  implementation. 

3.  ModuleMixers 

Two  ModuleMixers  are  includ  ed  in  the  reference  im  plementation,  however  only 
one  ModuleMixer  is  active  for  each  search,  as  chosen  by  the  user.  Th  e  details  of  each 
ModuleMixer  are  described  below. 

a.  Weighted  Average  Rank  ModuleMixer. 

This  ModuleMixer  sim  ply  calculates  the  weighted  m  ean  rank  for  each 
Document  (via  a  DocScore).  For  a  given  document,  it  uses  the  weights  assigned  to  each 
set  of  SearchResults  an  d  computes  the  weight  ed  mean  rank  of  that  docum  ent.  It  then 
creates  a  new  set  of  SearchResults  whose  DocScores  are  s  orted  by  th  e  new  weighted 
average  rank.  This  set  of  SearchResults  is  then  returned  to  the  user. 

b.  Condorcet  Fuse  ModuleMixer. 

This  ModuleMixer  implem  ents  th  e  m  etasearch  technique  known  as 
Condorcet-fuse  [13].  The  inspir  ation  for  this  technique  com  es  from  the  field  of  Social 
Choice  Theory  which  studies  voting  algorithms  as  techniques  to  m  ake  group  decisions 
[14-16].  The  Condorcet  voting  algorithm  specifies  that  the  winner  of  an  election  is  the 
candidate  that  beats  or  ties  with  every  other  candidate  in  a  pair-wise  comparison  [13,  17]. 
Consider  a  voting  seen  ario  in  which  ten  voters  are  voting  on  five  candidates  in  an 
election,  and  the  voters  m  ust  rank  all  five  can  di dates  in  order  of  preference.  Table  2 
depicts  one  possible  outcome  of  the  votes  for  this  scenario  [13]. 
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Number  of  Votes 

Candidate  Preference 

(in  order) 

3 

a,  b,  c,  d,  e 

3 

e,  b,  c,  a,  d 

2 

c,  b,  a,  d,  e 

2 

c,  d,  b,  a,  e 

Table  2.  Example  Voting  Scenario 


In  the  example,  consider  a  pair-wise  comparison  of  candidates  b  and  c;  six 
out  of  the  ten  voters  placed  candidate  b  ahead  of  candidate  c.  In  fact,  candidate  b  ranks 
above  every  other  candidate  in  a  pair-wis  e,  head-to-head  com  parison;  therefore, 
candidate  b  is  the  Condorcet  winner  [13]. 

This  is  the  essence  of  the  Condorcet-fuse  metasearch  m  ethod  and  the 
associated  ModuleMixer  in  the  reference  im  plementation.  Candidates  are  analogous  to 
Documents,  voters  to  SearchModules,  and  vote  preference  to  Search  Results.  The 
following  two  pseudo-code  algorithm  s  e  xplain  exactly  how  the  Condorcet-fuse 
metasearch  method  is  applied  within  the  Modular  Search  Engine  framework  [13]. 


Algorithm  1 :  Pair-wise  Document 

_ Comparison  (dj,  d2) _ 

1 :  count  =  0 

2:  for  each  SearchModule,  sm,  do 
2a:  If  sm  ranks  di  above  (I2,  count++ 
2b:  If  sm  ranks  th  above  di,  count— 
3:  If  count  >  0,  rank  di  better  than  tC 
4:  Otherwise  rank  d2  better  than  di 


Algorithm  2:  Condorcet-fuse _ 

1 :  Create  a  list  L  of  all  the 
documents 

2:  Sort(Z)  using  Algorithm  1  as  the 
comparison  function 

3:  Output  the  sorted  list  of 
documents  as  a  SearchResults  object 
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c. 


GRAPHICAL  USER  INTERFACE 


1.  Overview 


The  reference  implementation  can  be  divide  d  into  five  different  sections:  Query 
Entry,  Corpus  Selection,  ModuleMixer  Select  ion,  Status  Display,  and  Results  Display. 
Figure  25  is  a  screensh  ot  of  the  reference  im  plementation  GUI  and  i  dentifies  th  e  five 
basic  sections,  and  each  section  is  described  in  detail  below  the  figure. 


Query 

Entry 


{ 


Status 

Display 


Results 

Display 


> 


Corpus 

Selection 


ModuleMixer 

Selection 


Figure  25.  GUI  Overview 


2.  Sections 

a.  Query  Entry  Section 

As  Figure  26  indicates,  users  enter  thei  r  search  query  into  the  text  box; 
typing  <ENTER>  or  clicking  the  Search  button  will  begin  the  search. 
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Figure  26.  Query  Entry  Section 

b.  Corpus  Selection  Section 

As  previously  m  entioned,  the  refere  nee  im  plementation  contains  four 
different  corpora  to  choose  from.  The  Corpus  Selection  Sec  tion  allows  users  to  choose  a 
corpus  via  Radio  Button  as  shown  in  Figure  27.  By  default,  the  Cranfield  corpus  is 
selected  when  the  application  is  launched. 

Select  Corpus 

®  Cranfield  O  Cisi  0  Medline  Q  Time 

Figure  27.  Corpus  Selection  Section 

c.  ModuleMixer  Selection  Section 

Similar  to  the  Corpus  Selection  Sec  tion  above,  the  user  chooses  one  of 

two  available  ModuleMixers  via  radio  butt  on;  in  the  reference  imple  mentation  the 
WeightedModuleMixer  is  selected  by  defau  It.  This  ModuleMixer  requires  additional 
input  from  the  user  via  the  slider  bar.  M  oving  the  slider  bar  adjusts  the  relative  m  ixing 
weight  assigned  to  each  SearchModule.  In  Figure  28,  the  TF-IDF  based  SearchModule 
will  be  weighted  three  times  greater  than  the  other. 
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r  Select  Module  Mixer 

(•)  Weighted  Module  Mixer  Q  Condorcet  Fuse  Module  Mixer 

I — Set  Mixing  Weights 

Vector  Space  I  j  Draeger 

TF*IDF  LDA 

Module  j  ]  Module 

75%  25% 

Figure  28.  ModuleMixer  Selection  Section  with  Weighted  Module  Mixer  Selected 

If  the  CondorcetFuseModuleMixer  is  se  lected,  the  m  ixing  weights  are  no 
longer  applicable  and  that  sub-section  is  disabled  accordingly  as  depicted  in  Figure  29. 

Select  Module  Mixer 

o  Weighted  Module  Mixer  0  Condorcet  Fuse  Module  Mixer 

rSet  Mixing  Weights 


Figure  29.  ModuleMixer  Selection  Section  with  Condorcet  Fuse  Module  Mixer 

Selected 


d.  Status  Display  Section 

When  the  reference  implementation  is  running,  System.out  and  System. err 
are  redirected  to  the  Status  Display  as  shown  in  Figure  30  below.  This  area  is  s  crollable 
so  that  a  us  er  can  view  older  m  essages  which  may  have  s  crolled  up  and  out  of  view  or 
longer  messages  that  extend  to  the  right  of  the  view. 


67 


Status 


Figure  30.  Status  Display  Section 


e.  Results  Display  Section 

As  the  name  suggests,  the  results  of  t  he  search  query  are  displayed  in  this 
section.  In  this  exam  pie  application,  this  area  is  simply  populated  with  text  using  the 
toString()  m  ethod  of  the  final  SearchResults  object  produced  by  the  selected 

ModuleMixer.  Figure  31  is  an  example  of  wh  at  this  section  looks  like  after  conducting  a 
search.  Users  can  use  the  scroll  bars  to  view  the  entire  set  of  results. 


Search  Results 


whoMadeMe:  Weighted AvgRankModuleMixer  * 

weight:  1.0 

query:  testing  cranfield  search 
DocID:  879  Rank:  1  Score:  0. 571 42857 H2857 14 
DocID:  724  Rank:  2  Score:  0.26666666666666666 
DocID:  880  Rank:  3  Score:  0.26666666666666666 
DocID:  237  Rank:  4  Score:  0.1111111111111111 
DocID:  876  Rank:  5  Score:  0.10526315789473684 
DocID:  1336  Rank:  6  Score:  0.0851063829787234 
DocID:  486  Rank:  7  Score:  0.07692307692307693 
DocID:  214  Rank:  8  Score:  0.07272727272727272 
DocID:  742  Rank:  9  Score:  0.0625 
DocID:  1098  Rank:  10  Score:  0.056338028169014086 
DocID:  1335  Rank:  11  Score:  0.05405405405405406 
DocID:  739  Rank:  12  Score:  0.04938271604938271 
DocID:  640  Rank:  13  Score:  0.041237113402061855 
DocID:  1134  Rank:  14  Score:  0.038834951456310676 
DocID:  649  Rank:  15  Score:  0.037037037037037035 
DocID:  767  Rank:  16  Score:  0.03508771929824561 
DocID:  1069  Rank:  17  Score:  0.03361344537815126 
DocID:  1156  Rank:  18  Score:  0.03361344537815126 
DocID:  1317  Rank:  19  Score:  0.032520325203252036 
DocID:  1270  Rank:  20  Score:  0.03225806451612903 

DocID:  315  Rank:  21  Score:  0.031746031746031744  v 

<  _ ^ _  > 


Figure  3 1 .  Results  Display  Section 
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D.  PERFORMANCE  EVALUATION 


This  section  presents  how  the  Modular  Search  Engine  fra  mework  can  help 
students  and  researchers  de  sign  new  IR  techniques  and  metasearch  m  ethods  by 
calculating  and  evaluating  the  perfor  mance  of  the  dif  ferent  com  ponents  within  the 
reference  implementation. 

1.  Average  Precision 


a.  Definition 


For  a  particular  query,  we  use  average  precision  as  a  metric  to  measure  the 
performance  of  an  IR  technique  or  a  metasearch  method  [18].  The  average  precision  for 
a  single  query  is  defined  as 


1  £ 

ap  =—Yap 

D  » 

K  ,7=1 


where  R  is  the  num  her  of  total  relevant  documents  and  D  denotes  the  total  num  her  of 
documents  in  the  corpus.  The  contribution  of  docum  ent  dn  to  the  average  precision  A P„ 
is  defined  as 


ap=-Ys 

n  /  j  m, 

n  m= 1 


where  6mjl  =  1,  if  the  docum  ents  d„  and  dm  are  both  relevant  to  the  query,  and  dmM  =  0 
otherwise. 


b.  Example 

Each  corpus  included  in  the  reference  implementation  comes  with  a  set  of 
test  queries  and  a  relevancy  list  that  tells  which  documents  in  the  corpus  that  are  relevant 
to  each  test  query.  These  are  provid  ed  so  that  different  IR  and/or  m  etasearch  techniques 
can  be  compared  with  one  another.  For  ex  ample,  the  224th  test  query  for  the  C ran  field 

corpus  is:  “in  practice,  how  close  to  reality  are  the  as  sumptions  th  at  th  e  flow  in  a 
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hypersonic  shock  tube  using  nitrogen  is  non-viscous  and  in  thermodynamic  equilibrium.” 
There  are  exactly  nine  documents  identified  as  relevant  to  this  query. 

Using  the  reference  implementation,  one  can  see  how  each  SearchModu  le 
performs  compares  against  the  other  and  how  the  ModuleMixers  affect  that  performance 
when  searching  for  this  test  query.  Table  3  is  a  summary  of  how  the  two  SearchModules 
performed  independently  and  when  mixed  with  the  Condorcet-fuse  ModuleMixer. 


Relevant 

Document  ID 

LDA 

Ranking 

TF-IDF 

Ranking 

CondorcetFuse 

Ranking 

656 

6 

15 

7 

1157 

40 

10 

24 

1274 

113 

32 

43 

1286 

4 

3 

2 

1313 

15 

23 

11 

1316 

120 

27 

41 

1317 

26 

61 

15 

1318 

7 

117 

22 

1319 

100 

33 

33 

Table  3.  Relevant  Document  Rankings  for  the  224th  Cranfield  Test  Query 


With  the  inform  ation  in  Table  3,  we  can  calculate  the  average  precis  ion 
for  each  of  the  three  sets  of  results.  Table  4  d  isplays  the  average  precision  calculations 
for  the  results  of  Draeger’s  LDA  SearchModule. 
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nth  Relevant 
Document 

Relevant 

Document  ID 

LDA 

Ranking 

APn 

1 

1286 

4 

1/4  =  0.25 

2 

656 

6 

2/6  =  0.33333 

3 

1318 

7 

3/7  =  0.42857 

4 

1313 

15 

4/15  =  0.26667 

5 

1317 

26 

5/26  =  0.19231 

6 

1157 

40 

6/40  =  0.15 

7 

1319 

100 

7/100  =  0.07 

8 

1274 

113 

8/113  =  0.0708 

9 

1316 

120 

9/120  =  0.075 

Average  Precision  =0.20408 

Table  4.  Average  Precision  of  Draeger’s  LDA  SearchModule 


Table  5  dis  plays  the  average  precision  calculations  for  the  results  of  the 
TF-IDF  SearchModule. 


nth  Relevant 
Document 

Relevant 

Document  ID 

TF-IDF 

Ranking 

APn 

1 

1286 

3 

1/3  =  0.33333 

2 

1157 

10 

2/10  =  0.2 

3 

656 

15 

3/15  =  0.2 

4 

1313 

23 

4/23  =  0.17391 

5 

1316 

27 

5/27  =  0.18519 

6 

1274 

32 

6/32  =  0.1875 

7 

1319 

33 

7/33  =  0.21212 

8 

1317 

61 

8/61  =  0.13115 

9 

1318 

117 

9/117  =  0.07692 

Average  Precision  =0.1889 

Table  5.  Average  Precision  of  the  TF-IDF  SearchModule 


Table  6  dis  plays  the  average  precision  calculations  for  the  results  of  the 
Condorcet-fuse  ModuleMixer.  Note  that  the  average  precision  of  the  m  ixed  results  for 
this  query  is  higher  than  both  Draeg  er’s  LDA  SearchModule  and  the  T  F-IDF 
SearchModule. 
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nth  Relevant 
Document 

Relevant 

Document  ID 

CondorcetFuse 

Ranking 

APn 

1 

1286 

2 

1/2  =  0.5 

2 

656 

7 

2/7  =  0.28571 

3 

1313 

11 

3/11  =  0.27273 

4 

1317 

15 

4/15  =  0.26667 

5 

1318 

22 

5/22  =  0.22727 

6 

1157 

24 

6/24  =  0.25 

7 

1319 

33 

7/33  =  0.21212 

8 

1316 

41 

8/41  =  0.19512 

9 

1274 

43 

9/43  =  0.2093 

Average  Precision  =0.26877 

Table  6.  Average  Precision  of  the  CondorcetFuse  ModuleMixer 


2.  Mean  Average  Precision 

a.  Definition 

In  order  to  m  easure  the  overall  pe  rformance  of  an  IR  technique  o  r 
metasearch  method,  we  use  the  m  ean  average  precision.  Calculating  the  m  ean  average 
precision  is  as  sim  pie  as  calculating  the  average  precisio  n,  as  shown  above,  for  each 
query  in  the  set  of  test  queries  and  then  taking  the  mean  of  all  those. 

b.  Example 

The  Cranfield  corpus  contains  a  total  of  225  test  queries;  using  a  separate 
application  to  speed  th  e  proce  ss,  we  cal  culated  the  m  ean  averag  e  p  recision  of  both 
SearchModules  independently  and  when  m  ixed  with  the  Condorcet-fuse  ModuleMixer. 
Figure  32  s  hows  the  average  p  recision  calculations  for  each  test  query,  ordered  fr  om 
largest  to  sm  allest  for  each  m  ethod,  and  Ta  ble  7  shows  the  m  ean  average  p  recisions. 
Again,  the  Condorcet-fuse  ModuleMixer  ou  tperforms  both  of  the  independent 
SearchModules. 
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LDA 

TF-IDF 

CondorcetFuse 


Figure  32.  Average  Precision  of  Test  Queries 


LDA 

TF-IDF 

CondorcetFuse 

0.32711 

0.36701 

0.37637 

Table  7.  Mean  Average  Precisions 
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V.  CONCLUSIONS  AND  RECOMMENDATIONS 


A.  RESEARCH  CONCLUSIONS 

The  overarching  goal  of  this  thesis  was  to  develop  a  sof  tware  API  offering 
students  and  researchers  a  fram  ework  in  wh  ich  they  can  develop,  test,  and  im  plement 
new  IR  techniques  and  m  etasearch  m  ethods,  specifically  targeting  the  developm  ent  of 
new  semantic  search  techniques. 

Utilizing  sound  engineering  practices,  those  user  requirements  were  specified  and 
incorporated  into  the  overal  1  design  of  the  Modular  Search  E  ngine  framework.  Through 
extensive  use  of  the  Unified  Modeling  Langu  age,  software  engineering  patterns,  and 
object-oriented  features,  the  Modular  Search  Engine  framework  achieved  the  modularity 
goal  that  allows  m  ultiple  IR  techniques  to  work  simultaneously  within  a  single  sy  stem 
and  allows  IR  techniques  to  be  seam  lessly  added  and  deleted  from  a  system  .  Keeping 
with  the  objective  s,  the  addition  of  an  IR  technique  requires  only  the  extension  of  the 
single  abstract  SearchModule  cl  ass  with  its  eight  abstract  methods.  The  framework  also 
successfully  allows  for  the  developm  ent  of  different  m  etasearch  m  ethods  that  can  be 
interchanged  within  a  system. 

Furthermore,  this  thes  is  showed  co  nclusively,  using  a  standard  m  etric,  that  the 
framework  can  be  used  to  judge  the  relative  perfonnance  of  each  individual  IR  technique 
and  metasearch  method. 

B.  RECOMMENDATIONS  FOR  FUTURE  WORK 

Overall,  this  res  earch  successfully  accom  plished  its  objectives  as  d  efined  in 
Chapter  I.  However,  severa  1  areas  could  benefit  from  further  exploration,  augmentation, 
and  improvement. 

As  with  any  new  softwa  re  application,  the  framework  could  greatly  benefit  fro  m 
extensive  testing  and  debugging.  If  the  Modul  ar  Search  Engine  fram  ework  we  re  to 
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receive  greater  exposu  re  to  stud  ents  and  IR  research  ers,  th  eir  feedback  w  ould 
undoubtedly  benefit  the  framework  by  providing  information  for  patches  and  upgrades. 

One  upgrade  in  particular  would  be  the  development  and  inclusion  of  a  set  of 
diagnostic  tools.  These  tools  would  be  able  to  autom  atically  calc  ulate  the  m  etrics  to 
analyze  the  perfonn  ance  of  the  different  framework  components  using  the  benchm  ark 
test  corpora.  Such  tools  would  m  ake  it  trivial  f  or  the  develope  r  to  eva  luate  the 
performance  of  a  new  IR  technique  or  metasearch  method. 

Additionally,  as  end-user  applications  are  developed,  it  is  not  recommended  to 
build  them  as  stand-alo  ne  applications  design  ed  to  run  on  clien  t  machines.  Becau  se  of 
the  large  requirement  for  the  com  puter’s  resources,  such  applications  will  undoub  tedly 
run  extremely  slow  and  would  lik  ely  aggravate  any  user,  esp  ecially  during  initialization. 
Instead,  the  fra  mework  could  be  used  to  de  velop  a  server  application,  possibly  web- 
based,  that  clien  ts  could  access  to  perform  searches.  Th  is  sty  le  architecture  w  ould 
provide  the  most  responsiveness  to  users  while  preserving  resources  in  client  computers. 

Finally,  the  fra  mework  could  benefit  fir  om  t  he  incorporation  of  ontological 
information  such  as  those  suggested  for  th  e  SHARE  repository  [2].  Such  inform  ation 
could  be  used  to  develop  a  robust  system  that  allows  a  user  to  refine  search  queries  and 
navigate  through  documents  based  upon  the  ontological  relationships  of  the  documents. 
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APPENDIX-UML  REFERENCE  KEY 


This  appendix  contains  the  reference  fo  r  the  UML  sym  bols  used  in  Chapters  II 
and  III  of  this  thesis. 

A.  FIGURE  3-UML  DOMAIN  OBJECT  MODEL 

An  association  with  an  aggregation  relationship  indicates  that  one  class  is  a  part 
of  another  class.  In  this  relationship  the  child  class  instance  can  outlive  its  parent  class; 
the  existence  of  the  child  is  not  depende  nt  on  the  existence  of  the  parent.  The 

aggregation  relationship  is  represented  with  a  solid  line  drawn  from  the  parent  class  to  the 
child  class  with  an  open  diamond  shape  on  the  parent  class’s  end. 


For  example,  a  ModularSearchEngine  object  contains  a  single  Corpus  object,  but 
the  SearchResults  object  contains  one  or  more  DocScore  objects: 


B.  FIGURES  11-24  UML  CLASS  MODELS 


Each  class  m  ember  and  m  ethod  is  preced  ed  with  one  of  three  symbols  that 
indicate  its  visibility. 

UML  Visibility  types 
+  Public 
#  Protected 
-  Private 

Additionally,  if  any  m  ethod  name  or  class  nam  e  is  italicized  it  indicate  s  that  the 
method  or  the  class  is  abstract. 
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